CN117423108B

CN117423108B - Image fine granularity description method and system for instruction fine adjustment multi-mode large model

Info

Publication number: CN117423108B
Application number: CN202311273241.9A
Authority: CN
Inventors: 朱贵波; 李宗树; 吴凌翔; 易东; 刘智威; 葛国敬; 王金桥
Original assignee: Wuhan Artificial Intelligence Research Institute; Institute of Automation of Chinese Academy of Science
Current assignee: Wuhan Artificial Intelligence Research Institute; Institute of Automation of Chinese Academy of Science
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2024-05-24
Anticipated expiration: 2043-09-28
Also published as: CN117423108A

Abstract

The invention provides an image fine granularity description method and system for an instruction fine adjustment multi-mode large model, which relate to the technical field of computers, and comprise the following steps: acquiring a first vector sequence corresponding to a target image, wherein the first vector sequence corresponding to the target image is obtained by encoding a second vector sequence corresponding to the target image, and the second vector sequence corresponding to the target image is obtained according to high-level semantic information extracted from the target image; and acquiring the fine-granularity descriptive text of the target image according to the first vector sequence and a first prompt template, wherein the first prompt template is used for providing instruction information required for fine-granularity descriptive of the target image. The invention can accurately identify and describe the attribute and the characteristic of the important target in the image based on the instruction information required by carrying out fine-granularity description on the image and the vector sequence carrying the advanced semantic information of the image, realizes the fine-granularity description on the image, and provides richer and more specific image information.

Description

Image fine granularity description method and system for instruction fine adjustment multi-mode large model

Technical Field

The invention relates to the technical field of computers, in particular to an image fine granularity description method and system for a command fine adjustment multi-mode large model.

Background

In the field of artificial intelligence, multi-modal large models and instruction fine tuning are two main research directions, and research on multi-modal large models in recent two years is in blowout type growth, and the instruction fine tuning also shows effectiveness in practical application of downstream tasks in the field of natural language processing.

The goal of a multi-modal large model is to enable the model to understand and process different types of data, such as text, images, sound, etc. And in the pre-training stage, the multi-modal large model performs full supervision or self-supervision training on the large-scale multi-modal data set so as to learn the bottom mode of the multi-modal data and perform preliminary alignment on the multi-modal semantic information. However, the multimodal data of the pre-training stage often contains a lot of noise, which results in a reduced robustness of the multimodal big model in the application of downstream tasks, often manifested as an incomplete understanding of the multimodal information by the multimodal understanding task, resulting in erroneous classification and prediction; or the unimodal characterization generated by the multimodal generation task is completely inconsistent with the human thought logic and sense. Therefore, fine tuning of the multi-modal large model is necessary to adapt the multi-modal large model to downstream tasks.

Instruction trimming techniques are a new method of large model trimming that allows a large model to adapt to downstream tasks by learning a simple set of instructions without the need for complex trimming procedures. The main advantage of this approach is that it can reduce training time and enable large models to adapt more quickly to new downstream tasks. Instruction fine-tuning is achieved by describing tasks as part of the large model input. This approach allows the large model to learn new tasks without requiring additional marker data, thus significantly reducing the amount of data required for large model training.

While multimodal large models and instruction trimming techniques have made significant advances in the field of artificial intelligence, they still face many challenges. For example, for multi-modal large models, how to effectively fuse information of different modalities and how to design effective pre-training strategies remains a problem that requires further research. For instruction fine-tuning, how to design effective instructions, and how to deal with uncertainty of large models, is also a problem that requires further investigation.

On the other hand, in the multi-modal field, image description tasks have become a hotspot for research. Among them, fine-grained image description (Fine-GRAINED IMAGE capture) has attracted a lot of attention with its deep understanding and accurate description of image contents, and it is required to not only identify important objects in an image, but also describe specific attribute features of the important objects, such as shape, color, texture, etc. The image description mode can provide richer and more specific image information and has important value for a plurality of practical application scenes. However, fine-grained image description tasks also face challenges such as how to accurately identify and describe the attributes and features of important objects, and how to handle the complexity and diversity of information in images.

Disclosure of Invention

The method and the system for describing the image fine granularity of the instruction fine tuning multi-mode large model are used for solving the problem of precisely identifying and describing the attribute and the characteristic of an important target in the image and realizing the fine granularity description of the image in the prior art.

The invention provides an image fine granularity description method for an instruction fine adjustment multi-mode large model, which comprises the following steps:

Acquiring a first vector sequence corresponding to a target image, wherein the first vector sequence corresponding to the target image is obtained by encoding a second vector sequence corresponding to the target image, and the second vector sequence corresponding to the target image is obtained according to high-level semantic information extracted from the target image;

and acquiring a fine granularity description text of the target image according to the first vector sequence and a first prompt template, wherein the first prompt template is used for providing instruction information required for carrying out fine granularity description on the target image.

The invention also provides an image fine granularity description system for the instruction fine adjustment multi-mode large model, which comprises the following steps:

The data acquisition module is used for acquiring a first vector sequence corresponding to a target image, wherein the first vector sequence corresponding to the target image is obtained by encoding a second vector sequence corresponding to the target image, and the second vector sequence corresponding to the target image is obtained according to high-level semantic information extracted from the target image;

the image description module is used for acquiring fine-granularity description text of the target image according to the first vector sequence and a first prompt template, and the first prompt template is used for providing instruction information required for fine-granularity description of the target image.

The invention also provides electronic equipment, which comprises a processor and a memory storing a computer program, wherein the processor realizes the image fine granularity description method of the instruction fine tuning multi-mode large model when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an image fine granularity description method of an instruction fine tuning multi-modal large model as any one of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements an image fine granularity description method of instruction fine tuning a multi-modal large model as described in any one of the above.

The image fine granularity description method and system for the instruction fine tuning multi-mode large model can accurately identify and describe the attribute and the characteristic of an important target in an image based on the instruction information required by carrying out fine granularity description on the image and the vector sequence carrying the advanced semantic information of the image, realize the fine granularity description on the image and provide richer and more specific image information.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for describing image granularity of an instruction fine tuning multi-mode large model provided by the invention;

FIG. 2 is a schematic diagram of a model structure of a visual encoder in a multi-modal large model provided by the present invention;

FIG. 3 is a schematic diagram of a model structure of a text decoder in a multi-modal large model provided by the present invention;

FIG. 4 is a schematic diagram of a next word segmentation prediction task provided by the present invention;

FIG. 5 is a schematic illustration of a causal language modeled self-attention mask matrix provided by the present invention;

FIG. 6 is a schematic diagram of the forward propagation of a text decoder in the inference/test phase provided by the present invention;

FIG. 7 is a flow chart of instruction trimming for a multimodal big model provided by the present invention;

FIG. 8 is a schematic flow chart of the present invention for providing fine grained descriptive text for building images;

FIG. 9 is a second flow chart of a method for fine granularity description of an instruction trim multi-modal large model provided by the present invention;

FIG. 10 is a schematic diagram of a system for fine-granularity description of an instruction-fine-tuning multi-modal large model provided by the present invention;

Fig. 11 is a schematic diagram of the physical structure of the electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The image fine-granularity description method for the instruction fine-tuning multi-mode large model provided by the invention uses the instruction fine-tuning technology to describe the downstream image of the multi-mode large model, successfully realizes the instruction fine-tuning of the multi-mode large model by using a very small amount of multi-mode instruction data, further realizes the image fine-granularity description, and is specifically realized as follows:

FIG. 1 is a schematic flow chart of a method for describing image granularity of an instruction fine tuning multi-mode large model, which is provided by the invention, and as shown in FIG. 1, the method comprises:

Step 110, a first vector sequence corresponding to a target image is obtained by encoding a second vector sequence corresponding to the target image, and the second vector sequence corresponding to the target image is obtained according to high-level semantic information extracted from the target image;

And step 120, acquiring a fine granularity description text of the target image according to the first vector sequence and a first prompt template, wherein the first prompt template is used for providing instruction information required for fine granularity description of the target image.

It should be noted that, the execution subject of the above method may be a computer device.

Alternatively, the target image may be specifically an image to be subjected to fine-grained description, and more specifically may be a three-channel RGB image. The first vector sequence corresponding to the target image is obtained by encoding a second vector sequence corresponding to the target image, wherein the dimension of the second vector sequence is the same as the dimension of the first vector sequence. The first vector sequence may in particular consist of a plurality of one-dimensional vectors of fixed dimensions.

The second vector sequence corresponding to the target image is a numerical carrier of high-level semantic information of the target image, which can be obtained by extracting the high-level semantic information of the target image, more specifically, by inputting the target image into a visual encoder such as a visual transducer model. The high-level semantic information can be particularly used for representing the attribute and the characteristic of an important target in a target image, and can be widely applied to various downstream tasks such as target detection, instance segmentation, image depth estimation and the like.

The prompting template can also be called an instruction template, and the prompting template is any text prompting template (or called instruction information) randomly extracted from a text prompting template set, wherein the instruction information for carrying out fine-grained description on the target image is the first prompting template.

Optionally, fine-granularity description is performed on the target image through instruction information in the first prompt template and a first vector sequence corresponding to the acquired target image, and fine-granularity description text corresponding to the target image is generated or called fine-granularity image description text.

The image fine granularity description method of the instruction fine adjustment multi-mode large model provided by the invention can accurately identify and describe the attribute and the characteristic of an important target in an image based on the instruction information required by carrying out fine granularity description on the image and the vector sequence carrying the advanced semantic information of the image, realizes the fine granularity description on the image, and provides richer and more specific image information.

Further, in one embodiment, the acquiring fine-grained descriptive text of the target image according to the first vector sequence and the first hint template may include:

Inputting the first vector sequence and the first prompt template to a target text decoder to obtain the fine-granularity descriptive text, wherein the target text decoder is used for generating the fine-granularity descriptive text corresponding to the target image according to the first prompt template, the first vector sequence corresponding to the target image is obtained by inputting the second vector sequence corresponding to the target image to a target multi-modal information adaptation module in a multi-modal large model, and the target multi-modal information adaptation module is used for encoding the second vector sequence corresponding to the target image to obtain the first vector sequence corresponding to the target image;

The acquisition modes of the target text decoder and the target multi-mode information adaptation module comprise:

Splicing a first vector sequence corresponding to each image in an image data set and a second prompt template to obtain a first text sequence, wherein the second prompt template is used for providing instruction information required for carrying out fine-granularity description on the images in the image data set, and the text decoder is used for generating fine-granularity description text corresponding to each image according to the second prompt template;

splicing the fine-granularity description text corresponding to each image in the image data set to the back of the first text sequence to obtain a second text sequence;

Inputting the second text sequence to a text decoder in the multi-modal large model, and performing fine adjustment on a multi-modal information adaptation module in the multi-modal large model and the text decoder until a target loss function converges;

Taking the converged multi-modal information adaptation module as the target multi-modal information adaptation module;

And taking the converged text decoder as the target text decoder.

In exploring the right way to generic artificial intelligence (ARITIFICAIL GENERAL INTELLIGENCE, AGI), researchers have constructed multi-modal large models to perform cross-modal understanding and generation tasks. The multi-modal large model is capable of processing and understanding data of multiple modalities, such as text, images, audio, and the like. However, the true power of such a model is not only in its multi-modal capability, but rather in its extremely low fine-tuning cost, as well as in the powerful, low sample understanding and generating capabilities exhibited by the application in downstream tasks.

Trimming is a method of adapting a model to a particular downstream task. With fine tuning, the model is focused on understanding the details of a certain downstream task. The potential of the fine tuning method is huge, because by the fine tuning method, we can understand the downstream task more deeply and understand and generate the modal semantic information more accurately with extremely low data and hardware cost.

However, research on multi-modal large model tuning methods is still in the development stage, and researchers need more technology to explore and optimize the rapid tuning of multi-modal large models. Researchers need to more efficiently fine tune multi-modal large models, more accurately understand semantic information of each modality, and how to apply this method to more fields.

Although multimodal large models perform well on some downstream tasks, challenges remain in terms of fine-grained image description. This is because fine-grained image descriptions require models with the ability to understand the content of the image in depth, including identifying tiny details in the image, understanding the meaning of those details, and being able to translate those understandings into accurate detailed text descriptions. Therefore, the invention focuses on researching a fine-grained image description method based on instruction fine-tuning multi-mode large model, not only can promote the development of the multi-mode large model, but also can provide a new thought and method for solving the challenge.

The core idea of embodiments of the invention is to organically combine a multi-modal large model with an instruction fine tuning technique for fine-grained image description of downstream tasks.

The method can be used for the field of fine granularity image description, and can also produce positive promotion effect on other research fields of the multi-mode large model. First, the invention provides a brand new view to understand and describe the multi-modal data, so that we can understand the internal connection of the multi-modal data more deeply and describe the content of the multi-modal data more accurately. The understanding accuracy and the depth of description can provide new ideas and methods for researchers in other fields; secondly, the research of a downstream task fine tuning strategy of the multi-mode large model is promoted. The construction of the instruction micro-scale and instruction data set in the invention can provide new ideas and methods for the research of other fields of multi-mode large models; finally, the invention promotes the understanding and application of the multi-mode large model. By researching the invention, the working principle of the multi-mode large model can be deeply understood, the optimal application strategy of the multi-mode large model in the downstream task can be explored, and the application effect of the multi-mode large model can be optimized.

Optionally, the first vector sequence corresponding to the target image and the first prompt template corresponding to the first vector sequence are input into a target text decoder in the multi-mode large model, and fine-granularity description text corresponding to the target image is generated according to the first prompt template based on the target text decoder.

The first vector sequence corresponding to the target image is obtained by inputting the second vector sequence corresponding to the target image into a target multi-modal information adaptation module in a multi-modal information large model and encoding the second vector sequence corresponding to the target image based on the target multi-modal information adaptation module.

Further, in an embodiment, the acquiring manner of the second vector sequence corresponding to the target image may include:

Inputting the target image to a visual encoder in a multi-mode large model, acquiring the second vector sequence corresponding to the target image, wherein the visual encoder is used for extracting high-level semantic information in the target image and acquiring the second vector sequence corresponding to the target image.

Alternatively, the second vector sequence of the target image may be specifically obtained by inputting the target image to a visual encoder in a multimodal large model and encoding the target image into the image embedded vector sequence, i.e. the second vector sequence, based on the visual encoder. The visual encoder is used for extracting high-level semantic information in the target image to obtain a second vector sequence corresponding to the target image.

Optionally, the target text decoder and the target multi-modal information adaptation module in the multi-modal large model are obtained by:

Constructing an overall architecture of a multi-mode large model, wherein the multi-mode large model mainly comprises 3 parts, namely a visual encoder, a multi-mode information adaptation module and a text decoder;

And taking the image in the image data set, the fine-granularity descriptive text corresponding to the image and the prompt template (namely the second prompt template) corresponding to the image as the input of the multi-mode large model, and carrying out instruction fine adjustment on the multi-mode large model. Wherein, the images in the image data set are all three-channel RGB images.

Inputting each image in the image data set to a visual encoder, and extracting high-level semantic information in the image based on the visual encoder to obtain a second vector sequence corresponding to each image;

And inputting the second vector sequence corresponding to each image into a multi-mode information adapting module, encoding the second vector sequence based on the multi-mode information adapting module, and outputting a first sequence vector corresponding to the target image.

Fig. 2 is a schematic diagram of a model structure of a visual encoder in a multi-modal large model provided by the present invention. Referring to fig. 2, the visual encoder adopts a visual transducer structure, and inputs three channel RGB images. The size of the input image of the visual encoder is fixed (e.g., 224x 224), the input image is equally sliced into image blocks of the same size, the image block sizes are artificially specified (e.g., 14x14, 16x16, etc.), and then all image blocks are mapped into feature vectors of the image blocks using the full-join layer. The image block feature vectors undergo a self-attention operation in the vision Transformer (ViT) to output an image-embedded vector sequence, i.e., a second vector sequence, which is composed of a plurality of one-dimensional vectors of fixed dimensions. The visual encoder extracts high-level semantic information of the image, and the pre-trained visual encoder can be widely applied to various downstream tasks, such as target detection, instance segmentation and image depth estimation.

The dimension of the second vector sequence output by the multi-mode information adapting module is the same as the dimension of the first vector sequence input by the text decoder. The multi-mode information adaptation module has the functions that: (1) Changing the dimension of the second vector sequence output by the visual encoder to adapt it to the input of the text decoder; (2) The high-level semantic information of different modes has a mode gap, and the multi-mode information adaptation module further encodes the second vector sequence of the target image, so that the visual mode information extracted by the visual encoder can adapt to a text decoder, the mode gap between visual and text modes is eliminated, and semantic alignment is carried out on the visual mode information and the text mode information. The multi-mode information adapting module can generally adopt a single full-connection layer or a 2-layer or 3-layer multi-layer perceptron.

And splicing the first vector sequence corresponding to each image and the second prompt template to obtain a first text sequence, splicing the fine-grained description text corresponding to each image to the last of the first text sequence to obtain a second text sequence, wherein the second prompt template can be specifically any text prompt template (or called instruction information) randomly extracted from a text prompt template set, and can be specifically used for providing instruction information required for fine-grained description of the images in the image data set.

Inputting the second text sequence to a text decoder to perform fine adjustment on a multi-mode information adaptation module and the text decoder in the multi-mode large model until the target loss function converges;

and taking the converged multi-modal information adaptation module as a target multi-modal information adaptation module, and taking the converged text decoder as a target text decoder.

Optionally, fig. 3 is a schematic diagram of a model structure of a text decoder in a multi-modal large model provided by the present invention. Referring to fig. 3, the text decoder in the present invention may use a transducer as its structure (in order to ensure structural consistency between the visual encoder and the text decoder, the visual encoder uses a visual transducer structure), and the input of the text decoder in the instruction trimming stage is composed of 3 parts, which are respectively a second vector sequence output by the multimodal information adapting module, a second prompt template, and fine-grained descriptive text corresponding to the images in the image dataset. The input of the text decoder in the reasoning/testing stage consists of two parts, namely a prompt template and an image embedded vector sequence output by the multi-mode information adapting module.

The prompting template is used for prompting the text decoder to perform a fine-granularity image description task, or sending an instruction to the text decoder to generate a fine-granularity description text corresponding to an image, for example, the prompting template can be defined as: and generating a corresponding fine-grained descriptive text according to the image < image >, wherein < image > in the prompting template is replaced by an image embedded vector sequence (namely a second vector sequence) output by the multi-mode information adaptation module.

Optionally, fig. 4 is a schematic diagram of a next word segmentation prediction task provided by the present invention. Referring to fig. 4, the text decoder performs the next word segmentation prediction task, also called an autoregressive task, in the instruction fine-tuning stage, i.e., each word segment predicts only what is the next word segment adjacent to it, and does not predict the word segments preceding it and following it, and the corresponding loss function (i.e., the target loss function) of the next word segmentation prediction task is cross entropy loss. The next word segmentation prediction task of the text decoder in the training stage takes the fine-grained descriptive text of the images in the image dataset as a prediction target, and the fine-grained descriptive text of the images in the image dataset is also a true value of the text decoder, namely the cross entropy loss value only comprises the loss value of the fine-grained text descriptive part of the predicted images, and does not comprise the cross entropy loss value of the second prompting template and the second vector sequence corresponding to the images in the image dataset in the next word segmentation prediction task.

Alternatively, FIG. 5 is a schematic diagram of a causal language modeled self-attention mask matrix provided by the present invention. Referring to fig. 5, to accommodate the next word segmentation prediction task, the self-attention operation of the text decoder employs a causal language modeling mechanism.

The principle of causal language modeling is as follows: the word in the input sequence of the text decoder performs a self-attention operation only on it and the word preceding it, and does not perform a self-attention operation on the word following it. The causal language model is implemented specifically using a self-attention mask matrix. The self-attentive query sequence (query) and key sequence (key) are subjected to dot product operation in conventional self-attentive operation to obtain a logic matrix (logits), then the self-attentive mask matrix and the logic matrix are used for element-by-element addition, the self-attentive matrix is a lower triangular matrix, diagonal lines and lower triangular element values of the self-attentive mask matrix are all 0, and upper triangular element values are all minus infinity. The result of adding the self-attention mask matrix and the logic matrix element by element is called as a causal logic matrix, the causal logic matrix is further subjected to the normal self-attention softmax operation to obtain a causal self-attention coefficient matrix, and the upper triangle elements in the causal logic matrix are all minus infinity, so that after the softmax operation, the upper triangle elements in the causal self-attention coefficient matrix are all 0, namely the purpose that each word in the self-attention operation only carries out self-attention operation on the word and the word before the word is achieved, and the information leakage and model overfitting in the normal self-attention operation are prevented.

Alternatively, fig. 6 is a schematic diagram of the forward propagation of the text decoder provided by the present invention during the inference/test phase. Referring to fig. 6, the text decoder generates only one word segment per forward propagation in the inference/test phase until a special word segment marking the end of a sentence is generated. As with the instruction fine-tuning phase, the text decoder only generates the next word segment next to it in the inference/test phase. Assuming that the text decoder generates n tokens in total (including the special token marking the end of the sentence) during the inference/validation phase, the text decoder needs to forward propagate n times in total, each forward propagate adding the token generated by the previous forward propagate to the end of the input text sequence of the text decoder as a new input text sequence to the text decoder for predicting the next token.

Optionally, the visual encoder and the text decoder load model parameter values of the pre-trained visual encoder and the text decoder in the fine tuning stage, for example, if a visual transducer is used as the visual encoder, the pre-trained parameter values of the CLIP-ViT-large-patch14 can be directly loaded; using LLaMA as a text decoder, the pre-training parameter values of Vicuna-7B or Vicuna-13B can be directly loaded. The parameters of the multi-mode information adapting module adopt a random initialization mode. And in the instruction fine tuning stage, the weight of the visual encoder is frozen, and the weight of the multi-mode information adaptation module and the weight of the text decoder are fine-tuned.

Optionally, the fine-grained image description based on the multi-mode large model needs to design a text prompt template set in an instruction fine-tuning stage, and the image, the second prompt template and fine-grained description text corresponding to the image are used as input of the multi-mode large model. And taking the text prompt template set in the instruction fine tuning stage as the text prompt template set of the text decoder, randomly selecting a prompt template from the text prompt template set every time of iteration, and splicing the selected prompt template, the image sequence and the fine-granularity image description according to a certain sequence to obtain an input sequence of the text decoder. The function of the prompting template is mainly two: (1) And splicing a second vector sequence corresponding to the image output by the multi-mode information adapting module with a second prompt template to obtain a first text sequence, and reasonably splicing fine-grained description text of the image to the back of the first text sequence to obtain a text sequence which can be understood by a text decoder, namely a second text sequence. The second vector sequence output by the multimodal information adaptation module may be considered as another foreign language that is quite different from the fine grained descriptive text; (2) The prompt template contains rich instruction semantic information, and the text decoder performs fine-grained image description text generation tasks according to the rich instruction semantic information in the prompt template. For example, a text prompt template set may consist of the following prompt templates:

The image is described in detail by "< image embedding vector sequence start word > < image embedding vector sequence > < image embedding sequence end word >". ";

The "< image embedding sequence start word > < image embedding vector sequence > < image embedding sequence end word > < please describe important objects in the image, including detailed features of texture, color, relationship, number, etc. of the objects. ";

the "< image embedding sequence start segmentation > < image embedding vector sequence > < image embedding sequence end segmentation > carefully views the image and describes the content of the image as detailed as possible. ";

The "< image embedding sequence start word > < image embedding vector sequence > < image embedding sequence end word >" please provide a sufficiently detailed text description for this image. ";

Is you able to describe the content of this image for me in detail,? ".

The instruction semantics of the prompt templates are identical and differ only in the text expression. And prompting the start word segmentation of the < image embedding sequence > and the end word segmentation of the < image embedding sequence > in the template to serve as special words, and respectively marking the start and the end of the image embedding vector sequence.

Further, in one embodiment, the obtaining manner of the objective loss function includes:

Calculating the target loss function according to the prediction probability of the predicted word in the first word segmentation sequence corresponding to the second text sequence and the integer index of the fine-grained descriptive text corresponding to the image in the image dataset in a pre-constructed word segmentation vocabulary according to the first word segmentation sequence, the second word segmentation sequence and a third word segmentation sequence, wherein the prediction probability is obtained by removing the predicted word in the target word segmentation sequence, the target word segmentation sequence is obtained by segmenting the fine-grained descriptive text corresponding to each image in the second text sequence, the second word segmentation sequence is obtained by segmenting a second prompt template corresponding to each image, and the third word segmentation sequence is obtained by segmenting the first vector sequence corresponding to each image.

Optionally, fig. 7 is a schematic flow chart of instruction trimming of the multimodal big model provided by the present invention. Referring to fig. 7, the fine granularity image description method of the multi-mode large model uses an image, fine granularity description text corresponding to the image and a second prompt template as input of the multi-mode large model, and the visual encoder extracts visual semantic information from the image to obtain an image embedded vector sequence, namely a second vector sequence. The multi-mode information adaptation module eliminates the inherent semantic gap of the image embedded vector sequence output by the visual encoder. In the instruction fine tuning stage, a prompt template is randomly extracted from a text prompt template set, and then reasonably spliced with an image embedded vector sequence and fine-granularity descriptive text and used as input of a text decoder. The instruction fine tuning stage text decoder adopts the next word segmentation prediction task, and the loss function adopts conventional cross entropy loss.

According to the prediction probability of the predicted word in the first word segmentation sequence corresponding to the second text sequence and the integer index of the fine-grained descriptive text corresponding to the image in the image dataset in the word segmentation vocabulary, calculating a target loss function, taking the word segmentation of the fine-grained descriptive text corresponding to the integer index as a real target value of the cross entropy loss function, and calculating and obtaining the target loss function value. The pre-constructed word segmentation vocabulary (or word segmentation dictionary) may be derived based on existing word segmentation algorithms.

Wherein the target word segmentation sequence X ^c is obtained by the word segmentation of the fine-grained description text corresponding to each image in the second text sequence,Q represents the length of the word segmentation sequence obtained after the segmentation of the fine-granularity descriptive text of the image.

Obtaining a first word segmentation sequence by removing predicted word segmentation in the target word segmentation sequence, obtaining a second word segmentation sequence X ^p by segmentizing a second prompt template corresponding to each image in the image dataset,N represents the length of the word segmentation sequence obtained by the word segmentation of the second prompt template; a third word segmentation sequence X ^v is obtained by the word segmentation of the first vector sequence corresponding to each image, wherein/>M represents the length of the word segmentation sequence obtained after the word segmentation of the first vector sequence.

Calculating predicted word segmentation of the text decoder according to the first word segmentation sequence, the second word segmentation sequence and the third word segmentation sequenceI.e. the probability of the next word segment predicted by the text decoder/>

Wherein θ represents a weight of the multi-modal large model updated in the instruction fine-tuning stage. The multi-mode big model only carries out fine adjustment on parameters of the multi-mode information adapting module and the text decoder, namely theta represents parameters/weights of the multi-mode information adapting module and the text decoder.

The meaning of the next word segmentation prediction task is as follows: the predictive probability of a text decoder for a certain word in fine granularity descriptive text is based on three parts only: (1) a first vector sequence of images; (2) a second word segmentation sequence corresponding to the second prompt template; (3) The fine granularity describes a word segmentation sequence positioned in front of the predicted word segmentation in the target word segmentation sequence after text word segmentation, namely a first word segmentation sequence.

According to the prediction probability of the text decoder to the predicted word in the first word segmentation sequence corresponding to the second text sequenceAnd integer index Label _i of the fine-grained descriptive text corresponding to the image in the image dataset in the word segmentation vocabulary, and calculating cross entropy loss of the multi-mode large model in the instruction fine-tuning stage, namely a target loss function loss:

wherein, the two inputs of the cross entropy loss function value calculation formula are respectively And Label _i. The cross entropy function CrossEntropy in the cross entropy loss calculation formula calculates the cross entropy loss value of a word in the fine-grained descriptive text corresponding to the image for the word only, and the first input item in the cross entropy loss calculation formulaThe one-dimensional probability vector which is obtained by carrying out text decoder and softmax operation on a certain word in the fine-granularity descriptive text is represented, the value range is between [0,1], the dimension of the vector is equal to the total number of the words in the word segmentation vocabulary, and the one-dimensional probability vector is named as a logic Steen vector of the word corresponding to the fine-granularity descriptive text. The second entry in the cross entropy loss calculation formula is Label _i, which represents the integer index of the fine-grained descriptive text in the word segmentation vocabulary, and is also the word segmentation class Label of the logic-Di value. The word segmentation vocabulary is obtained by segmenting the second prompt template and the fine-granularity description text corresponding to each image.

The cross entropy function CrossEntropy can be calculated as follows:

In the calculation mode of the cross entropy function, taking a logic substance value corresponding to a word segmentation class label in a logic substance vector, taking a natural logarithm based on e, and then taking the opposite number to obtain a loss value of a single word segmentation in the fine-grained descriptive text, and taking an average of the loss values of all the words of the fine-grained descriptive text to obtain a loss function value of a next word segmentation prediction task.

According to the image fine granularity description method for the instruction fine adjustment multi-mode large model, which is provided by the invention, the multi-mode large model is used in the downstream fine granularity image description field, so that the understanding and generating capacity of the multi-mode large model is fully mined; the instruction fine tuning technology is organically combined with the multi-mode large model, so that the training cost of the multi-mode large model applied to downstream tasks is greatly reduced, fine-grained description of images is realized, and information in the images is fully mined.

Further, in one embodiment, the acquiring manner of the image dataset and the fine-grained descriptive text corresponding to each image includes:

Constructing an image data set according to the disclosed image-text pair data set;

Acquiring a fine-tuning instruction data set based on seed instruction text data, the public graphic pair data set and a high-performance dialogue model, wherein the seed instruction text data comprises at least one first graphic pair which consists of an image acquired from the public graphic pair data set and a fine-granularity descriptive text corresponding to the image written manually;

And filtering and denoising the fine-granularity descriptive text of the images in the instruction fine-tuning dataset by using the high-performance dialogue model to obtain fine-granularity descriptive text corresponding to each image.

Alternatively, FIG. 8 is a second schematic flow chart of the present invention for providing fine-grained descriptive text for constructing an image. Referring to fig. 8, constructing fine-grained descriptive text corresponding to each image in an image dataset mainly includes the following three steps: respectively designing seed instruction text data for the step (1); (2) Constructing an instruction fine-tuning data set of fine-grained image descriptions based on the seed instruction text data, the published graphic-text pair data set and a high-performance dialogue model (such as ChatGPT, GPT4 and the like); (3) And filtering and denoising post-processing is carried out on the fine-granularity descriptive text in the instruction fine-tuning data set of the fine-granularity image description by using a high-performance dialogue model, so as to obtain the fine-granularity descriptive text corresponding to each image.

Wherein the image dataset may be constructed specifically from images in the dataset according to the disclosed teletext. The disclosed teletext pair data set may specifically comprise the presently disclosed high quality teletext data sets including the teletext data sets of COCO, CC3M, etc.

The instruction trimming data at the present stage are all text instruction data, and in order to perform instruction trimming on the multi-mode large model to achieve fine-grained description of the image, it is necessary to construct an instruction trimming data set of fine-grained image description with high quality. First, seed instruction text data of an instruction fine-tuning data set of fine-granularity image descriptions are designed, and in order to ensure the quality of the seed instruction text data and the fine-granularity description text generated later, the seed instruction text data is designed completely by manpower. The seed instruction text data comprises at least one first image-text pair, wherein the first image-text pair consists of an image acquired from a public image-text pair data set and a fine-grained descriptive text corresponding to a manually written image.

In order to eliminate the influence of noise as much as possible, a certain number of high-quality image-text pairs (i.e. seed image-text pairs) are sampled from the existing disclosed high-quality image-text data sets (such as COCO, CC3M and the like), and the number of the high-quality image-text pairs is generally less than 200 as metadata of seed instruction text data. Because the image descriptions in the high-quality graphic data set disclosed in the prior art are generally short, fine-grained descriptive text is manually written based on metadata (a certain number of high-quality graphic pairs) of the seed instruction text data, for example, an original image description of a certain graphic pair in the metadata of the seed instruction text data is as follows: "three puppies sit on one woman's legs, looking outside the window".

The manual written/designed fine grained descriptive text is: "this is an image full of warmth and harmony. In the car, three focused puppies are comfortably seated on one woman's leg, all attracted to the outside scene. In front of the car, a phone can be seen resting calmly, possibly in preparation for women to record this good moment. This scene fills with a relaxed and pleasant atmosphere, an instant of pleasure. ". And writing fine-grained descriptive text into each graphic pair in the metadata of the seed instruction text data, and replacing the original image description in each graphic pair in the metadata with the fine-grained descriptive text of the manually written image to obtain a first graphic pair, thereby forming the manually designed seed instruction text data.

Further, in an embodiment, the obtaining the instruction trimming dataset based on the seed instruction text data, the published text-to-text dataset and the high-performance dialogue model may include:

selecting a first preset number of target image-text pairs from first image-text pairs included in the seed instruction text data as seed instruction text examples;

Splicing the seed instruction text example, a target text and a construction instruction to obtain a third text sequence, wherein the target text is obtained by converting size information of a second picture-text centering image, an original description text corresponding to the second picture-text centering image and detection frames of all targets in the second picture-text centering image into text forms, the second picture-text pairs are second preset number of picture-text pairs extracted from the public picture-text pair data set, the original description text is the description text of the second picture-text pairs in the public picture-text pair data set, the construction instruction is used for providing target instruction information for the high-performance dialogue model, and the target instruction information is instruction information required for carrying out fine-granularity description on the images in the image data set according to the seed instruction text example and the target text;

And inputting the third text sequence into the high-performance dialogue model to acquire the instruction fine-tuning data set.

Optionally, an instruction fine-tuning dataset of fine-grained image descriptions is built based on the seed instruction text data, the existing public graphic pair dataset, and a high-performance dialog model (e.g., chatGPT, GPT4, etc.). The presently disclosed image-text pair data set can select COCO, a second preset number (for example, 5 ten thousand) of image-text pairs (namely, a second image-text pair) are extracted from the presently disclosed image-text pair data set, the images in the second image-text pair, the detection frames of all targets in the images in the second image-text pair, such as rectangular frame marks, and the size information (for example, width and height) of the images in the second image-text pair are used as the metadata of the instruction fine-tuning data set for fine-grained image description; the seed instruction text data and the metadata of the instruction trimming data set are used as the input of a high-performance dialogue model, and the inherent instruction following capability of the high-performance dialogue model is utilized to acquire the fine-granularity descriptive text of the image.

Specifically, (1) selecting a first preset number of target image-text pairs from first image-text pairs included in the seed instruction text data as input examples/prompts, and naming the first preset number of target image-text pairs as seed instruction text examples; (2) Selecting a second preset number of image-text pairs from the instruction fine-tuning data set to obtain a second image-text pair, and converting the original description text of the image in the second image-text pair (namely the description text of the second image-text pair in the disclosed image-text pair data set), the rectangular frame labels of all targets in the image in the second image-text pair and the size information of the image into text forms, namely metadata of the text forms named as the images, namely target texts; (3) Splicing the seed instruction text example and the target text to be used as input of a high-performance dialogue model; (4) And selecting an instruction text for constructing the instruction trimming data set, and naming the instruction text as a construction instruction of the instruction trimming data set. The function of the construction instruction is to enable the high-performance dialogue model to generate fine-grained description text of the image according to the seed instruction text example and the target text; (5) And splicing the seed instruction text example and the target text to obtain a long text, and splicing the long text and the construction instruction in a reasonable manner to obtain a third text sequence which is used as the input of the high-performance dialogue model. In general, the input of the high performance dialog model includes three parts, a seed instruction text example, a target text, and a build instruction, respectively. The high-performance dialogue model utilizes the inherent instruction following capability and the input third text sequence to generate fine-grained descriptive text of the image; (6) And iterating the steps until the construction of the instruction fine tuning data set is completed.

Further, in an embodiment, filtering and denoising the fine-granularity descriptive text of the image in the instruction fine-tuning dataset by using the high-performance dialogue model, and obtaining the fine-granularity descriptive text corresponding to each image may include:

And splicing the fine-granularity descriptive text of the images in the instruction fine-tuning dataset and a post-processing instruction, inputting the spliced fine-granularity descriptive text and the post-processing instruction into the high-performance dialogue model, and obtaining the fine-granularity descriptive text corresponding to each image, wherein the post-processing instruction is used for guiding the high-performance dialogue model to filter and denoise the fine-granularity descriptive text of the images in the instruction fine-tuning dataset.

Optionally, filtering and denoising the fine-grained descriptive text of the images in the instruction fine-tuning dataset by using a high-performance dialogue model to obtain fine-grained descriptive text corresponding to each image. The step requires instruction texts to guide the high-performance dialogue model to carry out filtering and denoising post-processing, and the corresponding instruction texts are named as post-processing instructions. The method comprises the steps of extracting a certain amount of fine-granularity descriptive text from an instruction fine-tuning data set in each iteration, splicing a post-processing instruction and the fine-granularity descriptive text in a reasonable mode, and taking the post-processing instruction and the fine-granularity descriptive text as input of a high-performance dialogue model, wherein the output of the high-performance dialogue model is the denoised fine-granularity descriptive text.

The post-processing instruction can adopt the following text templates: errors in a given paragraph are repaired. Delete any duplicate sentences, nonsensical characters, non-chinese sentences, etc. Unnecessary duplication is removed. Any incomplete sentences are rewritten. The result is returned directly without interpretation. If the entered paragraph is already correct, the entered paragraph is returned directly without interpretation.

FIG. 9 is a second flow chart of a method for describing image granularity of an instruction trimming multi-mode large model according to the present invention, and referring to FIG. 9, the method includes:

constructing a multi-mode large model overall architecture, which mainly comprises a visual encoder, a multi-mode information adaptation module and a text decoder;

Designing seed instruction text data, and constructing an image fine granularity description instruction fine adjustment data set, namely a fine granularity description text of an image, based on a high-performance dialogue model;

instruction trimming is performed on the multi-modal large model by using the image, the fine-grained descriptive text and the prompt template.

According to the image fine granularity description method for the instruction fine adjustment multi-mode large model, which is provided by the invention, the convergence speed of the instruction fine adjustment of the multi-mode large model is improved by constructing a high-quality fine granularity description text, so that researchers can fine-adjust the multi-mode large model more efficiently, and the semantic information of each mode can be understood more accurately, so that the method is applied to more fields.

The image fine granularity description system of the instruction trimming multi-mode large model provided by the invention is described below, and the image fine granularity description system of the instruction trimming multi-mode large model described below and the image fine granularity description method of the instruction trimming multi-mode large model described above can be correspondingly referred to each other.

FIG. 10 is a schematic structural diagram of an image fine granularity description system for instruction trimming a multi-mode large model, as shown in FIG. 10, comprising:

A data acquisition module 1010, configured to acquire a first vector sequence corresponding to a target image, where the first vector sequence corresponding to the target image is obtained by encoding a second vector sequence corresponding to the target image, and the second vector sequence corresponding to the target image is obtained according to high-level semantic information extracted from the target image;

The image description module 1011 is configured to obtain fine-granularity description text of the target image according to the first vector sequence and a first hint template, where the first hint template is used to provide instruction information required for fine-granularity description of the target image.

The image fine granularity description system of the instruction fine adjustment multi-mode large model provided by the invention can accurately identify and describe the attribute and the characteristic of an important target in an image based on instruction information required by carrying out fine granularity description on the image and a vector sequence carrying high-level semantic information of the image, realizes fine granularity description on the image, and provides richer and more specific image information.

Fig. 11 is a schematic physical structure of an electronic device according to the present invention, as shown in fig. 11, the electronic device may include: a processor 1110, a communication interface communication interface 1111, a memory 1112, and a bus 1113, wherein the processor 1110, the communication interface 1111, and the memory 1112 perform communication with each other via the bus 1113. Processor 1110 may call logic instructions in memory 1112 to perform the following method:

Further, the logic instructions in the memory described above may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer power supply screen (which may be a personal computer, a server, or a network power supply screen, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a random-access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Further, the present invention discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of executing the image fine granularity description method of the instruction fine tuning multi-modal large model provided by the above method embodiments, for example, comprising:

In another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the method for fine-granularity description of images of instruction fine-tuning a multi-modal large model provided in the above embodiments, for example, including:

The system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer power screen (which may be a personal computer, a server, or a network power screen, etc.) to perform the method described in the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An image fine granularity description method for an instruction fine tuning multi-mode large model is characterized by comprising the following steps of:

Acquiring a fine granularity description text of the target image according to the first vector sequence and a first prompt template, wherein the first prompt template is used for providing instruction information required for carrying out fine granularity description on the target image;

the acquiring the fine granularity descriptive text of the target image according to the first vector sequence and the first prompt template comprises the following steps:

the acquisition mode of the fine granularity descriptive text corresponding to each image of the image data set comprises the following steps:

Acquiring an instruction fine-tuning data set based on seed instruction text data, the disclosed image-text pair data set and a high-performance dialogue model, wherein the seed instruction text data comprises at least one first image-text pair, and the first image-text pair consists of an image acquired from the disclosed image-text pair data set and a fine-granularity descriptive text corresponding to the image which is manually written;

Filtering and denoising the fine-granularity descriptive text of the images in the instruction fine-tuning dataset by using the high-performance dialogue model to obtain fine-granularity descriptive text corresponding to each image;

The obtaining the instruction fine tuning data set based on the seed instruction text data, the disclosed graphic-text pair data set and the high-performance dialogue model comprises the following steps:

2. The method for describing the image granularity of the instruction fine tuning multi-mode large model according to claim 1, wherein the obtaining mode of the target text decoder and the target multi-mode information adapting module comprises the following steps:

And taking the converged text decoder as the target text decoder.

3. The method for describing the image granularity of the instruction fine tuning multi-mode large model according to claim 2, wherein the calculation mode of the target loss function comprises the following steps:

4. The method for describing the image granularity of the instruction trimming multi-mode large model according to claim 1, wherein filtering and denoising the fine granularity descriptive text of the image in the instruction trimming dataset by using the high-performance dialogue model to obtain fine granularity descriptive text corresponding to each image comprises the following steps:

5. The method for describing the image granularity of the instruction fine tuning multi-mode large model according to any one of claims 1 to 3, wherein the acquiring mode of the second vector sequence corresponding to the target image comprises the following steps:

6. An image fine granularity description system for an instruction fine tuning multi-modal large model, comprising:

The image description module is used for acquiring fine-granularity description text of the target image according to the first vector sequence and a first prompt template, and the first prompt template is used for providing instruction information required for fine-granularity description of the target image;

the image description module is specifically used for:

the image description module is specifically further configured to:

Constructing an image data set according to the disclosed image-text pair images in the data set;

the image description module is specifically further configured to:

7. An electronic device comprising a processor and a memory storing a computer program, wherein the processor, when executing the computer program, implements the method of image fine granularity description of an instruction fine-tuning multi-modal large model of any one of claims 1 to 5.

8. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the method of image fine granularity description of an instruction fine-tuning multi-modal large model according to any one of claims 1 to 5.