CN117576689A

CN117576689A - Image description text generation method and device, computer equipment and storage medium

Info

Publication number: CN117576689A
Application number: CN202311487597.2A
Authority: CN
Inventors: 舒畅; 肖京; 陈又新
Original assignee: Ping An Technology Shanghai Co ltd
Current assignee: Ping An Technology Shanghai Co ltd
Priority date: 2023-11-08
Filing date: 2023-11-08
Publication date: 2024-02-20

Abstract

The application relates to the field of image processing and digital medical treatment, and particularly discloses a method and a device for generating an image description text, computer equipment and a storage medium. Encoding an object to be described based on at least two preset image encoders to obtain an image feature vector; processing the image feature vector based on an encoder module of a preset model to obtain a target feature vector; splicing the target feature vectors to obtain a vector to be decoded; decoding is performed based on the decoder module, and descriptive text is obtained. According to the method, the object to be described is encoded through a plurality of preset image encoders to obtain local image features, the obtained local image features are encoded through encoder modules of a preset model to obtain global feature vectors, the decoder modules decode to obtain description texts of images, the local features and the global features are combined, the attention of the global features and the local features is balanced, and the accuracy of the image description texts is improved.

Description

Image description text generation method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of image processing and digital medical technology, and in particular, to a method and apparatus for generating an image description text, a computer device, and a storage medium.

Background

The application scene of picture description is very wide, text information labels after the description of the unlabeled pictures can be extracted, and the pictures, even videos, can be generated into descriptive languages. For example, in the medical field, after a patient is subjected to health examination, a medical image picture is generated, and features of the medical image picture are extracted to generate a description text corresponding to the medical image picture, so that an examination report of the patient can be directly generated, and therefore, the accuracy of the description text is important.

The conventional image description text generation method generally adopts a simple codec structure, for example, a picture encoder is used for carrying out feature encoding on the picture, and a decoder is used for carrying out text decoding on the encoded picture features to generate corresponding characters. At present, the image description text is generated by using a contrast learning framework, after the characteristics of the image and the text are mapped to the same characterization space, the image description is generated by a decoder, and as the characteristics of different modes are mapped to the same space, the performance on the same characterization is improved greatly. However, in the conventional contrast learning method, the picture is globally and singly encoded, and the local object features in the image are ignored, so how to balance the attention of the global and local features in the image, and further, the accuracy of the picture description text is improved, which is a problem to be solved.

Disclosure of Invention

The application provides a method, a device, computer equipment and a storage medium for generating an image description text, so as to balance the attention of global and local features in an image and further improve the accuracy of the image description text.

In a first aspect, the present application provides a method for generating an image description text, where the method includes:

encoding an object to be described based on at least two preset image encoders respectively to obtain image feature vectors corresponding to the object to be described;

the encoder module based on a preset model respectively processes the image feature vectors to obtain target feature vectors corresponding to the image feature vectors;

splicing the target feature vectors to obtain vectors to be decoded;

and decoding the vector to be decoded based on a decoder module of the preset model to obtain a description text corresponding to the object to be described.

In a second aspect, the present application further provides an apparatus for generating an image description text, where the apparatus includes:

the image feature vector obtaining module is used for respectively encoding the object to be described based on at least two preset image encoders to obtain an image feature vector corresponding to the object to be described;

the target feature vector obtaining module is used for respectively processing the image feature vectors based on an encoder module of a preset model to obtain target feature vectors corresponding to the image feature vectors;

the vector to be decoded obtaining module is used for splicing the target feature vectors to obtain a vector to be decoded;

the description text obtaining module is used for decoding the vector to be decoded based on the decoder module of the preset model to obtain the description text corresponding to the object to be described.

In a third aspect, the present application also provides a computer device comprising a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and implement the method for generating the image description text as described above when the computer program is executed.

In a fourth aspect, the present application further provides a computer readable storage medium storing a computer program, which when executed by a processor causes the processor to implement a method for generating an image description text as described above.

The application discloses a method, a device, computer equipment and a storage medium for generating an image description text, which are used for respectively encoding an object to be described based on at least two preset image encoders to obtain an image feature vector corresponding to the object to be described; the encoder module based on a preset model respectively processes the image feature vectors to obtain target feature vectors corresponding to the image feature vectors; splicing the target feature vectors to obtain vectors to be decoded; and decoding the vector to be decoded based on a decoder module of the preset model to obtain a description text corresponding to the object to be described. According to the method, a plurality of preset image encoders are used for encoding an object to be described to obtain local image features, the encoder modules of the preset models are used for encoding the obtained local image features to obtain global target feature vectors, and the decoder modules are used for decoding to obtain description texts of the images.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a first embodiment of a method for generating image description text provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a feature encoding flow of a method for generating an image description text according to an embodiment of the present application;

FIG. 3 is a flow chart illustrating a second embodiment of a method for generating image description text provided by an embodiment of the present application;

FIG. 4 is a flow chart illustrating a third embodiment of a method for generating image description text provided by embodiments of the present application;

FIG. 5 is a schematic block diagram of an apparatus for generating image description text according to an embodiment of the present application;

fig. 6 is a schematic block diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

It is to be understood that the terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

The embodiment of the application provides a method, a device, computer equipment and a storage medium for generating an image description text. The method for generating the image description text can be applied to a server, and the attention of global and local features in the image is balanced by combining the local features and the global features of the image, so that the accuracy of the image description text is improved. The server may be an independent server or a server cluster.

Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a schematic flowchart of a method for generating an image description text according to an embodiment of the present application. The method for generating the image description text can be applied to a server and used for balancing the attention of global and local features in the image by combining the local features and the global features of the image, so that the accuracy of the image description text is improved.

As shown in fig. 1, the method for generating the image description text specifically includes steps S101 to S104.

S101, respectively encoding an object to be described based on at least two preset image encoders to obtain an image feature vector corresponding to the object to be described.

Based on at least two preset image encoders, encoding an object to be described respectively to obtain image feature vectors corresponding to the object to be described, wherein the method comprises the following steps: and respectively encoding the object to be described based on the first image encoder and the second image encoder to obtain a first image feature vector and a second image feature vector.

In one embodiment, the object to be described may be a picture or a video. For example, the medical image picture generated by the health examination of the patient can also be a poster, a marketing picture and a propaganda picture in some financial fields, which can automatically generate corresponding description, and the training content is automatically generated for the training video.

In one embodiment, the image features of different feature emphasis points of the image to be described are obtained by combining a plurality of image encoders, and the prediction generation of the description text of the object to be described is completed by combining the image features of different emphasis points. For example, in a medical image picture generated by a patient examination, different image encoders may be focused on different body parts, or there may be different points of image focus, e.g., one image encoder may focus only on a bone fracture of a patient, and another image encoder may focus on whether there is a lesion in the bone of the patient.

In a specific embodiment, two (possibly more) image encoders are used, e.g. an object detection based encoder as the first image encoder and a grid based encoder as the second image encoder. The same object to be described is encoded by the first image encoder and the second image encoder, and as the emphasis points of the first image encoder and the second image encoder are different, for example, the object detection emphasizes the characteristic of the object in the picture, and the grid encoder emphasizes the picture distribution and the relevance characteristic, the first image characteristic vector and the second image characteristic vector which are different from each other are obtained.

S102, respectively processing the image feature vectors by an encoder module based on a preset model to obtain target feature vectors corresponding to the image feature vectors.

The encoder module based on a preset model respectively processes the image feature vectors to obtain target feature vectors corresponding to the image feature vectors, and the encoder module comprises: based on the self-attention layer of the coding module, respectively carrying out self-attention weight calculation on the first image feature vector and the second image feature vector to obtain a first weight parameter of the image feature in the first image feature vector and a second weight parameter of the image feature in the second image feature vector; and processing the first image feature vector and the second image feature vector based on the first weight parameter and the second weight parameter respectively to obtain the first target vector and the second target vector.

In one embodiment, the pre-set model is determined on demand, for example with a transducer (deep learning model based on the attention mechanism) model as the pre-set model. The preset model includes an encoder module and a decoder module.

In one embodiment, the encoder and decoder modules of the pre-set model consist of N layers of encoders or decoders.

In one embodiment, feature fusion is performed on multiple image views, and when two different image encoders are set, two image features, namely a first image feature vector and a second image feature vector, are obtained after the same object to be described is encoded, the first image feature vector is denoted as v1, and the second image feature vector is denoted as v2, and represents features encoded by different image encoders.

In one embodiment, the first image feature vector is vector-stitched with the second image feature vector and then input to a coding module of the preset model, which is a self-attention coding layer of the preset model. As shown in fig. 2, v1 and v2 are respectively regarded as independent features to perform self-attention calculation to obtain a weight parameter of the feature of each image feature vector, and each image feature vector is recombined according to the weight parameter to obtain a new v1 and v2 feature vector. Namely a first target vector and a second target vector.

And S103, splicing the target feature vectors to obtain vectors to be decoded.

In one embodiment, image feature vectors obtained after encoding objects to be described by different encoders are input into a preset model, and after the preset model outputs target vectors corresponding to the image feature vectors, vector stitching is performed on the target vectors to obtain vectors to be decoded.

In one embodiment, for example, a first target vector obtained from a first image feature vector, a second target vector obtained from a second image feature vector, and the first target vector and the second target vector are spliced to obtain a vector to be decoded.

S104, decoding the vector to be decoded based on a decoder module of the preset model to obtain a description text corresponding to the object to be described.

In one embodiment, the obtained vector to be decoded is input to a decoder module of a preset model, and a text description of the object to be described is generated through decoding.

In one embodiment, the preset model decodes the vector to be decoded, records the predicted word predicted at each moment, and takes the predicted word at the previous moment as the input parameter for predicting the predicted word at the current moment after outputting the predicted word at the previous moment. I.e. the output parameter at each instant will be the input parameter at the next instant.

In one embodiment, the predicted words at all times are spliced to obtain the description text of the object to be described.

The above embodiments provide a method, an apparatus, a computer device, and a storage medium for generating an image description text, which are based on at least two preset image encoders to encode an object to be described respectively, so as to obtain an image feature vector corresponding to the object to be described; the encoder module based on a preset model respectively processes the image feature vectors to obtain target feature vectors corresponding to the image feature vectors; splicing the target feature vectors to obtain vectors to be decoded; and decoding the vector to be decoded based on a decoder module of the preset model to obtain a description text corresponding to the object to be described. According to the method, a plurality of preset image encoders are used for encoding an object to be described to obtain local image features, the encoder modules of the preset models are used for encoding the obtained local image features to obtain global target feature vectors, and the decoder modules are used for decoding to obtain description texts of the images.

Referring to fig. 3, fig. 3 is a schematic flowchart of a method for generating an image description text according to an embodiment of the present application. The method for generating the image description text can be applied to a server and used for balancing the attention of global and local features in the image by combining the local features and the global features of the image, so that the accuracy of the image description text is improved.

As shown in fig. 3, the step S103 of the image description text generation method specifically includes steps S201 to S203.

S201, decoding the vector to be decoded based on the decoder module to obtain a first predicted word output at a first moment;

s202, predicting an output word at a second moment based on the first predicted word to obtain a second predicted word;

and S203, generating the description text based on the second predicted word and the second predicted word.

In one embodiment, the preset model decodes the vector to be decoded, records the predicted word predicted at the first time, and takes the predicted word at the first time as an input parameter for predicting the predicted word at the second time after outputting the predicted word at the first time. I.e. the output parameter at each moment will be the input parameter at the next moment until all predictions are completed. And after all the predicted words are predicted, splicing the recorded predicted words, and taking the obtained text as the description text of the current object to be described.

In one embodiment, the predicted word at the previous time is denoted as y _1:t-1 The predicted word at the current time is marked as y _t The modeling formula is:

wherein x is an object to be described, t is the current time, and the probability corresponding to the predicted word P (y|x) is y.

Since different image encoders are used, it is possible to obtain:

where v represents different image feature vectors, v _j Representing the jth image feature vector generated by the jth image encoder encoding.

Since v is obtained from the object x to be described, the above formula may be approximately equal to the unnecessary x, namely:

wherein for different image feature vectors v _j Can be regarded as the weight of each view feature, i.e. beta _j To make the predicted word y of the current moment _t The prediction of (1) is denoted as f _θ (v _j ) I.e. P (v) _j |y _1:t-1 )≡β _j ，P(y _t |v _j ,y _1:t-1 )≡f _θ (v _j ) Where θ represents a parameter in f, learned by the model from training data. It is possible to obtain:

so the generation and decoding of the text word at the current time t only needs to input different image characteristics v and the predicted word at the previous time t-1.

Referring to fig. 4, fig. 4 is a schematic flowchart of a method for generating an image description text according to an embodiment of the present application. The method for generating the image description text can be applied to a server and used for balancing the attention of global and local features in the image by combining the local features and the global features of the image, so that the accuracy of the image description text is improved.

As shown in fig. 4, before the step S103 of the method for generating the image description text, the method specifically further includes steps S301 to S303.

S301, acquiring a sample image, and acquiring a description text corresponding to the sample image as a sample text;

s302, determining a positive sample set and a negative sample set based on the sample image and the sample text;

and S303, training a pre-training model based on the positive sample set and the negative sample set to obtain the preset model.

Based on the sample image and the sample text, determining a positive sample set and a negative sample set includes: when the first image encoder is the same as the input sample image of the second image encoder and the input text is the correct description text of the input image, determining the input image and the input text as a positive sample image and a positive sample text, and generating the positive sample set; when the input image of the first image encoder is different from the input image of the second image encoder and the input text is not the correct descriptive text of the input image, determining the input image and the input text as a negative sample image and a negative sample text, generating the negative sample set.

Training a pre-training model based on the positive sample set and the negative sample set to obtain the preset model, including: processing sample images and sample texts in the positive sample set and the negative sample set respectively based on the first image encoder, the second image encoder and the text encoder to obtain positive sample similarity and negative sample similarity; based on a preset contrast loss function, the positive sample similarity and the negative sample similarity, respectively obtaining a first loss value between the first image encoder and the second image encoder, a second loss value between the first image encoder and the text encoder and a third loss value between the second image encoder and the text encoder; calculating a loss value of the pre-training model based on the first loss value, the second loss value and the third loss value to obtain a model loss value; and when the model loss value reaches a preset value, determining the pre-training model as the preset model.

In one embodiment, it is necessary to map the picture features and text sentence features to the same expression space when training the encoder.

In one embodiment, a sample image is acquired, along with the correct descriptive text corresponding to the sample image, as sample text.

In one embodiment, a positive sample training and a negative sample training are required in the model training process, wherein the positive sample is the same sample image input to both image encoders, and the descriptive text is the correct descriptive text corresponding to the sample image. The negative sample is to input different sample images to the two image encoders, and the input text is not the description corresponding to the input sample images. If the training is positive, the target label similarity is 1, and if the training is negative, the target label similarity is 0.

In one embodiment, the positive sample set includes a positive sample image and descriptive text corresponding to the sample image, and the negative sample set includes a negative sample image and sample text, wherein the sample text in the negative sample set is not the correct descriptive text corresponding to the negative sample image.

In a specific embodiment, model training is performed using a loss function of info NCE (comparative learning loss). And respectively calculating the similarity between the first sample image characteristic of the positive sample set and the first sample image characteristic of the negative sample set after the first image encoder encodes the sample image and the text characteristic of the text encoder after the sample text is encoded, the similarity between the second sample image characteristic of the image to be described after the second image encoder encodes the text, and the similarity between the first sample image characteristic of the sample image encoded by the first image encoder and the second sample image characteristic of the image to be described after the second image encoder encodes the sample image, and obtaining the positive sample similarity and the negative sample similarity.

In one embodiment, the values of the info nce between the encoders are obtained according to the substitution of the positive sample similarity and the negative sample similarity between the features obtained by the encoders into the contrast loss function (i.e. the info nce function), and finally the model loss value of the model is obtained by summation. And training the encoder by using the model loss value, and determining the pre-training model at the moment as a preset model when the model loss value reaches an expected value.

Referring to fig. 5, fig. 5 is a schematic block diagram of an apparatus for generating an image description text for performing the foregoing method for generating an image description text according to an embodiment of the present application. The image description text generating device may be configured in a server.

As shown in fig. 5, the image description text generating apparatus 400 includes:

the image feature vector obtaining module 401 is configured to encode an object to be described based on at least two preset image encoders, respectively, to obtain an image feature vector corresponding to the object to be described;

the target feature vector obtaining module 402 is configured to process the image feature vectors respectively based on an encoder module of a preset model, and obtain target feature vectors corresponding to the image feature vectors;

a to-be-decoded vector obtaining module 403, configured to splice the target feature vectors to obtain to-be-decoded vectors;

the description text obtaining module 404 is configured to decode the vector to be decoded based on the decoder module of the preset model, and obtain a description text corresponding to the object to be described.

In one embodiment, the image feature vector obtaining module 401 includes:

and the image coding unit is used for respectively coding the object to be described based on the first image coder and the second image coder to obtain a first image feature vector and a second image feature vector.

In one embodiment, the target feature vector obtaining module 402 includes:

the weight parameter obtaining unit is used for respectively carrying out self-attention weight calculation on the first image feature vector and the second image feature vector based on the self-attention layer of the coding module to obtain a first weight parameter of the image feature in the first image feature vector and a second weight parameter of the image feature in the second image feature vector;

and the target vector obtaining unit is used for respectively processing the first image feature vector and the second image feature vector based on the first weight parameter and the second weight parameter to obtain the first target vector and the second target vector.

In one embodiment, the descriptive text obtaining module 404 includes:

the first predicted word obtaining unit is used for decoding the vector to be decoded based on the decoder module to obtain a first predicted word output at a first moment;

a second predicted word obtaining unit, configured to predict an output word at a second moment based on the first predicted word, to obtain a second predicted word;

and the descriptive text generation unit is used for generating the descriptive text based on the second predicted word and the second predicted word.

In one embodiment, the apparatus 400 for generating image description text further includes: a model training module, the model training module comprising:

the sample acquisition unit is used for acquiring a sample image and acquiring a description text corresponding to the sample image as a sample text;

a sample set determining unit configured to determine a positive sample set and a negative sample set based on the sample image and the sample text;

the preset model obtaining unit is used for training a pre-training model based on the positive sample set and the negative sample set to obtain the preset model.

In one embodiment, the sample set determining unit includes:

a positive sample set generation subunit configured to determine the input image and the input text as a positive sample image and a positive sample text when the input sample images of the first image encoder and the second image encoder are the same and the input text is a correct description text of the input image, and generate the positive sample set;

a negative sample set generation subunit configured to determine the input image and the input text as a negative sample image and a negative sample text when the input image of the first image encoder is different from the input image of the second image encoder and the input text is not a correct description text of the input image, and generate the negative sample set.

In one embodiment, the preset model obtaining unit includes:

the similarity calculation subunit is used for respectively processing the sample images and the sample texts in the positive sample set and the negative sample set based on the first image encoder, the second image encoder and the text encoder to obtain positive sample similarity and negative sample similarity;

a loss value calculating subunit, configured to obtain a first loss value between the first image encoder and the second image encoder, a second loss value between the first image encoder and the text encoder, and a third loss value between the second image encoder and the text encoder, respectively, based on a preset contrast loss function, the positive sample similarity, and the negative sample similarity;

a model loss value obtaining subunit, configured to calculate a loss value of the pre-training model based on the first loss value, the second loss value, and the third loss value, to obtain a model loss value;

and the preset model determining subunit is used for determining the pre-training model as the preset model when the model loss value reaches a preset value.

It should be noted that, for convenience and brevity of description, the specific working process of the apparatus and each module described above may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

The apparatus described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 6.

Referring to fig. 6, fig. 6 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device may be a server.

With reference to FIG. 6, the computer device includes a processor, memory, and a network interface connected by a system bus, where the memory may include a non-volatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program comprises program instructions that, when executed, cause a processor to perform any of a number of methods for generating image description text.

The processor is used to provide computing and control capabilities to support the operation of the entire computer device.

The internal memory provides an environment for the execution of a computer program in a non-volatile storage medium, which when executed by a processor, causes the processor to perform any of a number of methods for generating image description text.

The network interface is used for network communication such as transmitting assigned tasks and the like. It will be appreciated by those skilled in the art that the structure shown in fig. 6 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

It should be appreciated that the processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein in one embodiment the processor is configured to run a computer program stored in the memory to implement the steps of:

splicing the target feature vectors to obtain vectors to be decoded;

In one embodiment, the processor is configured to, when implementing at least two image encoders based on a preset, encode an object to be described respectively to obtain an image feature vector corresponding to the object to be described, implement:

and respectively encoding the object to be described based on the first image encoder and the second image encoder to obtain a first image feature vector and a second image feature vector.

In one embodiment, when implementing the encoder module based on the preset model, the processor is configured to process the image feature vectors respectively to obtain target feature vectors corresponding to the image feature vectors, so as to implement:

based on the self-attention layer of the coding module, respectively carrying out self-attention weight calculation on the first image feature vector and the second image feature vector to obtain a first weight parameter of the image feature in the first image feature vector and a second weight parameter of the image feature in the second image feature vector;

and processing the first image feature vector and the second image feature vector based on the first weight parameter and the second weight parameter respectively to obtain the first target vector and the second target vector.

In one embodiment, the processor is configured to, when implementing a decoder module based on the preset model, decode the vector to be decoded to obtain a description text corresponding to the object to be described, implement:

decoding the vector to be decoded based on the decoder module to obtain a first predicted word output at a first moment;

predicting the output word at the second moment based on the first predicted word to obtain a second predicted word;

and generating the descriptive text based on the second predicted word and the second predicted word.

In one embodiment, before implementing the encoder module based on the preset model, the processor is further configured to, before processing the image feature vectors to obtain target feature vectors corresponding to the image feature vectors, implement:

acquiring a sample image, and acquiring a description text corresponding to the sample image as a sample text;

determining a positive sample set and a negative sample set based on the sample image and the sample text;

and training a pre-training model based on the positive sample set and the negative sample set to obtain the preset model.

In one embodiment, the processor, when implementing determining a positive sample set and a negative sample set based on the sample image and the sample text, is configured to implement:

when the first image encoder is the same as the input sample image of the second image encoder and the input text is the correct description text of the input image, determining the input image and the input text as a positive sample image and a positive sample text, and generating the positive sample set;

when the input image of the first image encoder is different from the input image of the second image encoder and the input text is not the correct descriptive text of the input image, determining the input image and the input text as a negative sample image and a negative sample text, generating the negative sample set.

In one embodiment, the processor is configured to, when implementing training a pre-training model based on the positive sample set and the negative sample set to obtain the preset model, implement:

processing sample images and sample texts in the positive sample set and the negative sample set respectively based on the first image encoder, the second image encoder and the text encoder to obtain positive sample similarity and negative sample similarity;

based on a preset contrast loss function, the positive sample similarity and the negative sample similarity, respectively obtaining a first loss value between the first image encoder and the second image encoder, a second loss value between the first image encoder and the text encoder and a third loss value between the second image encoder and the text encoder;

calculating a loss value of the pre-training model based on the first loss value, the second loss value and the third loss value to obtain a model loss value;

and when the model loss value reaches a preset value, determining the pre-training model as the preset model.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, the computer program comprises program instructions, and the processor executes the program instructions to realize the method for generating any image description text provided by the embodiment of the application.

The computer readable storage medium may be an internal storage unit of the computer device according to the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, which are provided on the computer device.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for generating an image description text, comprising:

splicing the target feature vectors to obtain vectors to be decoded;

2. The method for generating an image description text according to claim 1, wherein the image encoder includes a first image encoder and a second image encoder, the encoding is performed on objects to be described based on at least two preset image encoders, respectively, to obtain image feature vectors corresponding to the objects to be described, and the method includes:

3. The method for generating an image description text according to claim 2, wherein the encoder module based on a preset model processes the image feature vectors respectively to obtain target feature vectors corresponding to the image feature vectors, and includes:

4. The method for generating image description text according to any one of claims 1 to 3, wherein the decoder module based on the preset model decodes the vector to be decoded to obtain the description text corresponding to the object to be described, and includes:

5. The method for generating an image description text according to claim 1, wherein the encoder module based on a preset model processes the image feature vectors respectively, and before obtaining the target feature vector corresponding to the image feature vector, the method further comprises:

6. The method of generating image description text according to claim 5, wherein the determining a positive sample set and a negative sample set based on the sample image and the sample text includes:

7. The method for generating image description text according to claim 5, wherein training a pre-training model based on the positive sample set and the negative sample set to obtain the pre-training model includes:

8. An image description text generation apparatus, comprising:

9. A computer device, the computer device comprising a memory and a processor;

the memory is used for storing a computer program;

the processor is configured to execute the computer program and implement the method for generating an image description text according to any one of claims 1 to 7 when the computer program is executed.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the image description text generation method according to any one of claims 1 to 7.