CN114743018B

CN114743018B - Image description generation method, device, equipment and medium

Info

Publication number: CN114743018B
Application number: CN202210423256.8A
Authority: CN
Inventors: 舒畅; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2024-05-31
Anticipated expiration: 2042-04-21
Also published as: CN114743018A

Abstract

The invention relates to the technical field of artificial intelligence and provides an image description generation method, an image description generation device, image description generation equipment and an image description medium. The method comprises the following steps: inputting an image to be detected into a preset target detection model for identification, and outputting the regional characteristics of the image to be detected; inputting the regional characteristics into a preset tag attention model for weight calculation, and outputting the category embedding of the image to be detected; inputting the regional characteristics into an encoder of a preset transformation model for processing, and outputting an output value of the encoder; and embedding the output value and the category into a decoder input into the preset transformation model for processing, and generating a description text of the image to be detected. The invention also relates to the technical field of blockchain, and the regional characteristics and the category embedding can be stored in a node of a blockchain.

Description

Image description generation method, device, equipment and medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for generating an image description.

Background

Image description (Image capture) is a comprehensive emerging discipline that incorporates computer vision technology, natural language processing technology, and machine learning technology. The purpose of image description is to automatically generate a section of descriptive text according to the content of the picture.

As the transducer model has become popular in the NLP field, many transducer-based image description methods have emerged in succession and exhibit better performance than most conventional methods, with improvements made by the attention mechanism module of the input position coding and encoder section to better adapt to the image-input model than the transducer model for natural language processing.

However, the current method cannot integrate abstract features, such as the relationship between image targets and the mapping relationship between the targets and the corresponding labels, into an attention mechanism, and the obtained description information is not accurate and rich enough.

Disclosure of Invention

In view of the above, the present invention provides a method, apparatus, device and medium for generating image descriptions, which aims to solve the technical problem that in the prior art, the image generation description information is not accurate and rich enough.

To achieve the above object, the present invention provides an image description generation method, including:

Inputting an image to be detected into a preset target detection model for identification, and outputting the regional characteristics of the image to be detected;

Inputting the regional characteristics into a preset tag attention model for weight calculation, and outputting the category embedding of the image to be detected;

Inputting the regional characteristics into an encoder of a preset transformation model for processing, and outputting an output value of the encoder;

And embedding the output value and the category into a decoder input into the preset transformation model for processing, and generating a description text of the image to be detected.

Preferably, the inputting the image to be detected into a preset target detection model for identification, and outputting the regional characteristics of the image to be detected includes:

According to a preset geometric relation calculation formula, carrying out frame recognition on targets contained in the image to be detected to obtain frames of the targets and target categories of each frame;

And adjusting the size of the frame to a preset range, and outputting the regional characteristics of the image to be detected.

Preferably, the preset geometric relation calculation formula includes:

Wherein ζ (a, b) is the regional characteristic of the image to be measured, (x _a,y_a) is the center point coordinate of the a-th frame of the image to be measured, (x _b,y_b) is the center point coordinate of the b-th frame of the image to be measured, (w _a,h_a) is the width and height of the a-th frame, and (w _b,h_b) is the width and height of the b-th frame.

Preferably, the inputting the region feature into a preset tag attention model for weight calculation, outputting the category embedding of the image to be detected, includes:

Matching the target category of the image to be detected with a preset word of a preset multidimensional dictionary according to a preset matching formula to obtain a predicted word and a target label of the target category;

And carrying out coding embedding on the predicted word according to a preset first attention formula to obtain category embedding of the image to be detected.

Preferably, the encoding and embedding of the predicted word according to a preset first attention formula, so as to obtain a category embedding of the image to be detected, the preset tag attention model includes a plurality of attention modules, each of the attention modules includes an independent scaling dot product attention function, including:

a1, inputting the predicted word into a matrix of a first attention module for weight calculation according to the preset first attention calculation formula and the scaling dot product attention function, and outputting a first weight value of the first attention module;

a2, inputting the first weight into a matrix of a second attention module for weight calculation, and outputting a second weight value of the second attention module;

A3, repeating the steps A1-A2 to obtain weight values of all attention modules, splicing all weight values according to a series splicing function, and outputting the category embedding of the image to be detected.

Preferably, the encoder includes a plurality of identical encoding layers, each encoding layer includes a multi-head self-attention sub-layer and a position feedforward sub-layer, the multi-head self-attention sub-layer includes a plurality of parallel head modules, the encoder inputting the region characteristics into a preset transformation model for processing, outputting output values of the encoder, and the method includes:

B1, inputting geometric features of the region features into a matrix of a first parallel head module in a first coding layer according to a preset second attention calculation formula to perform weight calculation, and outputting a first result value of the first parallel head module;

b2, inputting the first result value into a matrix of a second parallel head module for weight calculation, and outputting a second result value of the second parallel head module;

B3, repeating the steps B1-B2 to obtain result values of all parallel head modules, splicing all the result values according to a preset splicing formula, inputting the result values into the position feedforward sub-layer block for nonlinear transformation, and inputting the result values into a second coding layer of the encoder;

B4, repeating the B1-B3 to output the output values of all the coding layers.

Preferably, the decoder includes a plurality of identical decoding layers, each decoding layer includes a covering multi-head self-attention sub-layer, multi-head cross-attention sub-layer and a position forward sub-layer, the decoder for embedding the output value and the category into the preset transformation model to process, and generating the description text of the image to be detected includes:

The output value of the last coding layer is subjected to position embedding and used as the input of the covering multi-head self-attention sub-layer, and an input word vector is obtained;

Embedding each output value, each input word vector and each category of the target into the multi-head cross attention sub-layer for cross attention calculation to obtain a weight matrix;

And inputting the weight matrix into the position forward sub-layer for calculation to generate a plurality of keywords, and splicing all the keywords to generate the description text of the image to be detected.

In order to achieve the above object, the present invention also provides an image description generation apparatus, the apparatus comprising:

And an identification module: the method comprises the steps of inputting an image to be detected into a preset target detection model for identification, and outputting regional characteristics of the image to be detected;

The calculation module: the method comprises the steps of inputting the regional characteristics into a preset tag attention model for weight calculation, and outputting the category embedding of the image to be detected;

And an output module: the encoder is used for inputting the regional characteristics into a preset transformation model for processing and outputting the output value of the encoder;

the generation module is used for: and the decoder is used for embedding the output value and the category into the preset transformation model for processing, and generating the description text of the image to be detected.

To achieve the above object, the present invention also provides an electronic device including:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores a program executable by the at least one processor to enable the at least one processor to perform the image description generation method according to any one of claims 1 to 7.

To achieve the above object, the present invention also provides a computer readable medium storing an image description generation which, when executed by a processor, implements the steps of the image description generation method according to any one of claims 1 to 7.

The invention consists of three models, namely a preset target detection model, a preset tag attention model and a transducer model (a preset transformation model). The image to be detected is identified and classified according to a preset target detection model, a geometric relationship and a position relationship can be established between any two targets by combining an identification frame in the identification process, and the category and the regional characteristics of the targets of the image to be detected are output.

According to a preset tag attention model and a preset multidimensional dictionary, important category embedding is given to targets which are more frequently appeared in regional characteristics and used as new tags, target values of encoders of the category embedding and preset transformation models are input into a decoding stage of the preset transformation models, and description information of images to be detected is generated. The accuracy of the relation between targets in the image description information is improved, so that the description content is more enriched.

Drawings

FIG. 1 is a flow chart diagram of a preferred embodiment of the image description generation method of the present invention;

FIG. 2 is a block diagram of an image description generating apparatus according to a preferred embodiment of the present invention;

FIG. 3 is a schematic diagram of an electronic device according to a preferred embodiment of the present invention;

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence (ARTIFICIALINTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The invention provides an image description generation method. Referring to fig. 1, a method flow diagram of an embodiment of an image description generating method according to the present invention is shown. The method may be performed by an electronic device, which may be implemented in software and/or hardware. The image description generation method comprises the following steps S10-S40:

step S10: inputting an image to be detected into a preset target detection model for identification, and outputting the regional characteristics of the image to be detected.

The specific step S10 includes:

In this embodiment, the preset target detection model includes, but is not limited to, fasterR-CNN target detection model, fasterR-CNN target detection model is a model integrating feature extraction, frame regression and classification. Taking MSCOCO dataset as a class database of a preset target detection model, and carrying out target detection on an input image to be detected by the preset target detection model, wherein the method specifically comprises the following steps of: firstly, a Convlayers layer of a target detection model is preset to extract image features from an image, the image features are shared in an RPN layer and a full-connection layer to perform calculation according to a preset geometric relation calculation formula, frame recognition is performed to obtain frames of a target and target categories of each frame, and the target categories comprise foreground information and background information (for example, the foreground information is a target object of the image).

In the frame identification process, in order to better cover the characteristics of the image to be detected, the frame is encoded by using four coordinate parametersAnd (3) representing the position information of the anchor point and the real frame, wherein the four coordinate parameters respectively represent the coordinates, the width and the height of the central point of the target frame, and learning four scalar quantities through linear regression to enable the anchor point to be continuously close to the real frame, so that the frame of the target in the image to be detected is accurately obtained.

In order to facilitate the generation of the descriptive text, a bilinear interpolation method is used for the image feature mapping region corresponding to the target, the size of the frame is adjusted to a preset range (for example, the size of the frame is adjusted to the edge pixel part of the target in the image), and finally, the region features of the input image are obtained. In order to take the geometric relationship and the position relationship between the targets into consideration in the generation of the description text, the geometric relationship and the position relationship between any two targets can be established based on the frame obtained in the target detection, the geometric relationship represents the relationship between different targets on the image to be detected, and the position relationship represents the position representation of one target on the image to be detected.

In one embodiment, the preset geometrical relationship calculation formula includes:

The geometric relationship and the position relationship between two targets of the image to be detected can be obtained through xi (a, b), and the geometric characteristics between targets in different areas can be obtained through transformation

Geometric featuresComprising: /(I)Where Emb is embedding (embedding) geometric features between objects, mapping the relationship vector ζ (a, b) between objects to higher dimensions, and w _G is a learnable vector projecting the vector result to a scalar.

Step S20: and inputting the regional characteristics into a preset tag attention model for weight calculation, and outputting the category embedding of the image to be detected.

The specific step S20 includes:

In one embodiment, the encoding and embedding the predicted word according to a preset first attention formula to obtain a category embedding of the image to be detected, the preset label attention model includes a plurality of attention modules, each of the attention modules includes an independent scaling dot product attention function, including:

In one embodiment, the preset matching formula includes:

Lⁱ＝Emb(D(w^j)),whenCⁱ＝＝D(w^j)

Wherein, L ⁱ is the i-th target label of the image to be measured, C ⁱ is the i-th target class of the image to be measured, whenC ⁱ＝＝D(w^j) is the j-th preset word of the corresponding preset multi-dimensional dictionary of the i-th target class of the image to be measured.

In one embodiment, to give more weight to more important and more frequently occurring target categories, the ranking of the ith target tag in all the detection targets is calculated on the basis of L ⁱ, and specifically includes: r ⁱ＝Lⁱ*Pr(Cⁱ), where Pr (C ⁱ) is the probability that the i-th target label corresponds to the target category in all categories.

In one embodiment, the preset first attention calculation formula includes:

L_Att＝σ(MHA(L,Rⁱ，L))

Wherein L _Att is the category embedding of the image to be detected, sigma is a sigmoid activation function, L is the region feature, and R ⁱ is the ranking of the ith target label in all detection targets.

In one embodiment, the calculation formula for each of the scaled dot product attention functions includes

Q_i＝W_qQ,V_i＝W_vV,K_i＝W_kK

Wherein d is a low-dimensional vector input by the image to be detected, Q, K and V are respectively query, key and value matrixes and Concat are series splicing functions of the preset tag attention model,And for h Attention modules of the preset tag Attention model, W ^O is a weight value, and Attention is an Attention function.

In one embodiment, before the step S20, the method further includes:

acquiring a plurality of preset words stored in advance corresponding to different images of a preset corpus;

And calculating word frequency values of all preset words appearing in the preset corpus, and constructing the preset multidimensional dictionary according to the preset words with the word frequency values larger than the preset values.

The corpus of MSCOCO datasets is made up of a large number of image descriptions (descriptive text) for which each image may correspond to a plurality of image descriptions. The image descriptions are composed of a plurality of preset words, the preset words with the occurrence times larger than a preset value (for example, the preset value is 5) in all the image descriptions are constructed into a preset multidimensional dictionary and serve as references for generating target labels, and the preset words with the occurrence times more than 5 are constructed into the preset multidimensional dictionary, so that the generated description of the image can be more consistent with personification.

Step S30: and inputting the regional characteristics into an encoder of a preset transformation model for processing, and outputting an output value of the encoder.

In a specific step S30, the encoder includes a plurality of identical encoding layers, each encoding layer including a multi-head self-care sub-layer and a position feedforward sub-layer, the multi-head self-care sub-layer including a plurality of parallel head modules, including:

B4, repeating the B1-B3 to output the output values of all the coding layers.

In one embodiment, the inputting the geometric feature of the region feature into the matrix of the first parallel head module in the first coding layer for weight calculation, outputting the first result value of the first parallel head module, and includes:

Activating a scaling dot product attention function corresponding to the first parallel head module, mapping the geometric features of the region features to a matrix of the first parallel head module, and performing feature embedding;

and embedding the relation vector between the targets into different sub-modules of the multi-head self-attention sub-layer for fusion by adjusting the weight parameters, and outputting a first result value of a first parallel head module.

In one embodiment, the preset second attention calculation formula includes:

h_i(Q,K,V,η)＝Attention(Q,K,V,η)＝softmax(η_i)V_i，i∈[1，N]，

Wherein eta is the geometric feature to be fused into the image to be detected, h _i is the ith parallel head module of the multi-head self-Attention sub-layer, attention is the Attention function, Q, K and V are the query, key and value matrixes of the multi-head self-Attention sub-layer respectively.

Each h _i calculation formula includes:

h_i(Q,K,V,η)＝Attention(Q,K,V,η)＝softmax(η_i)V_i，i∈[1，N]

The η calculation formula in each h _i includes:

η _G is the geometric relationship between different targets of the image to be measured, and η ^ab is the attention weight after the image to be measured is fused into the geometric relationship.

In one embodiment, the preset stitching formula includes:

wherein, As the initial value of the image to be detected, concat is a splicing function,/>For the h parallel head modules of the multi-head self-attention sub-layer, W ^O is a weight value.

Step S40: and embedding the output value and the category into a decoder input into the preset transformation model for processing, and generating a description text of the image to be detected.

In a specific step S40, the decoder includes a plurality of identical decoding layers, each decoding layer including a masking multi-head self-attention sub-layer, a multi-head cross-attention sub-layer, and a position forward sub-layer, including:

In this embodiment, the position encoding is performed on the predicted word of the target category, then the encoded predicted word is input into the masking multi-head self-attention sub-layer to obtain a word vector of the weighted sentence, the word vector is a V vector of the multi-head cross attention sub-layer of the first layer, the output value of the last layer of the encoder is converted into Q, K vectors through two linear conversion layers, and then multi-head self-attention operation is performed with the V vectors to obtain a V vector (equal to the input word vector) blended with the similarity information.

The V vector integrated with the similarity information is transmitted to a position forward sub-layer after passing through a residual connection layer and normalization, the V vector is output as the input of the next layer of the decoder after passing through the residual connection layer and normalization, the second decoding layer is free of covering multi-head self-attention sub-layer operation like the first layer, multi-head self-attention calculation is directly carried out, Q, K and V vectors are all results after three linear matrix transformations output from the previous layer of decoder, after passing through a total of 6 layers of decoder layer operation, the output vector is subjected to one linear layer and one softmax layer according to word information of a real sentence corresponding to each training picture in vocabulary set of a preset transformation model, and the next keyword is obtained.

And splicing all the keywords to generate a plurality of output sentences, setting beamsize to 2 by adopting a beamsearch method, finally obtaining the evaluation index score of each output sentence, and selecting the sentence with the highest score as the descriptive text.

In one embodiment, the embedding each of the output value, the input word vector, and the category of the target into the multi-headed cross attention sub-layer for cross attention calculation includes:

according to a preset cross attention calculation formula, embedding the output value and the category of the target to be fused with each other to obtain a fused value

And carrying out weight calculation on the blended value and the input word vector according to a preset weight calculation formula to obtain a cross attention matrix.

In one embodiment, the preset cross attention calculation formula includes:

Wherein MA is a fusion connection attention module of the multi-head cross attention sub-layer, alpha _i is a weight matrix, the size of the weight matrix is the same as the cross attention result, the weight can adjust the contribution degree of the output of each layer of encoder, And Y is the input word vector for the blending value.

In one embodiment, the preset blending calculation formula of the blending value includes:

Wherein the said For the blending value,/>For the output value, L _Att is the class embedding of the target.

In one embodiment, the preset weight calculation formula includes:

[. Cndot. ] is a merge operation, σ is a sigmoid activation function, W _i∈R^2d×d is a weight matrix, b _i is a learnable bias parameter, bias parameter.

Since the sequence is input once in the encoder, all input information can be acquired when the hidden multi-head self-attention sub-layer is calculated, but in the decoder, in order to ensure that only the sequence information output before the current moment can be seen at each moment, the hidden multi-head self-attention sub-layer is introduced, and the input word vector is the calculation result of the input information through the hidden multi-head self-attention sub-layer.

Each head cross attention sub-layer needs to be subjected to regularization of an Add & Nor layer, a position forward sub-layer (FFN layer) and an Add & Nor layer, and is mainly used for converting the input of each head cross attention sub-layer into the input with the same mean variance, so that convergence can be quickened, and the calculation formula is as follows:

and finally determining the next output key word through the output characteristics of the last decoding layer, wherein the dimensions of the characteristics of the output key word are the same as those of the vocabulary set.

Referring to fig. 2, a functional block diagram of an image description generating apparatus 100 according to the present invention is shown.

The image description generation apparatus 100 of the present invention may be installed in an electronic device. The image description generating apparatus 100 may include an identification module 110, an identification module 20, an output module 130, and a generation module 140 according to the implemented functions. The module of the present invention may also be referred to as a unit, meaning a series of computer program segments capable of being executed by the processor of the electronic device and of performing fixed functions, stored in the memory of the electronic device.

The functions of the respective modules/units are as follows in this embodiment:

The identification module 110: the method comprises the steps of inputting an image to be detected into a preset target detection model for identification, and outputting regional characteristics of the image to be detected;

the identification module 20: the method comprises the steps of inputting the regional characteristics into a preset tag attention model for weight calculation, and outputting the category embedding of the image to be detected;

the output module 130: the encoder is used for inputting the regional characteristics into a preset transformation model for processing and outputting the output value of the encoder;

The generating module 140: and the decoder is used for embedding the output value and the category into the preset transformation model for processing, and generating the description text of the image to be detected.

In one embodiment, the inputting the image to be detected into a preset target detection model for identification, and outputting the region characteristics of the image to be detected includes:

In one embodiment, the inputting the region feature into a preset tag attention model for weight calculation, and outputting the category embedding of the image to be detected includes:

In one embodiment, the encoder includes a plurality of identical encoding layers, each encoding layer including a multi-head self-care sub-layer and a position feedforward sub-layer, the multi-head self-care sub-layer including a plurality of parallel head modules, the encoder inputting the region characteristics into a preset transformation model for processing, outputting output values of the encoder, including:

B4, repeating the B1-B3 to output the output values of all the coding layers.

In one embodiment, the decoder includes a plurality of identical decoding layers, each decoding layer includes a covering multi-head self-attention sub-layer, multi-head cross-attention sub-layer and a position forward sub-layer, the decoder for embedding the output value and the category into the preset transformation model to process, and generating the descriptive text of the image to be detected includes:

Referring to fig. 3, a schematic diagram of a preferred embodiment of an electronic device 1 according to the present invention is shown.

The electronic device 1 includes, but is not limited to: memory 11, processor 12, display 13, and network interface 14. The electronic device 1 is connected to a network through a network interface 14 to obtain the original data. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (GlobalSystemofMobilecommunication, GSM), a wideband code division multiple access (WidebandCodeDivisionMultipleAccess, WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, or a call network.

The memory 11 includes at least one type of readable medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the electronic device 1, such as a hard disk or a memory of the electronic device 1. In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in hard disk, a smart memory card (SMARTMEDIACARD, SMC), a secure digital (SecureDigital, SD) card, a flash memory card (FLASHCARD), etc. that are equipped with the electronic device 1. Of course, the memory 11 may also comprise both an internal memory unit of the electronic device 1 and an external memory device. In this embodiment, the memory 11 is typically used for storing an operating system and various types of application software installed in the electronic device 1, such as program codes of the image description generation 10. Further, the memory 11 may be used to temporarily store various types of data that have been output or are to be output.

Processor 12 may be a central processing unit (CentralProcessingUnit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chip in some embodiments. The processor 12 is typically used for controlling the overall operation of the electronic device 1, e.g. performing data interaction or communication related control and processing, etc. In this embodiment, the processor 12 is configured to execute the program code stored in the memory 11 or process data, such as the program code for executing the image description generation 10.

The display 13 may be referred to as a display screen or a display unit. The display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (EmittingDiode, OLED) touch, or the like in some embodiments. The display 13 is used for displaying information processed in the electronic device 1 and for displaying a visual work interface, for example displaying the results of data statistics.

The network interface 14 may alternatively comprise a standard wired interface, a wireless interface, such as a WI-FI interface, which network interface 14 is typically used for establishing a communication connection between the electronic device 1 and other electronic devices.

Fig. 3 shows only the electronic device 1 with components 11-14 and image description generation 10, but it should be understood that not all shown components are required to be implemented, and that more or fewer components may be implemented instead.

Optionally, the electronic device 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (EmittingDiode, OLED) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.

The electronic device 1 may also include radio frequency (RadioFrequency, RF) circuits, sensors and audio circuits, etc., which are not described in detail herein.

In the above embodiment, the processor 12 may implement the following steps when executing the image description generation 10 stored in the memory 11:

The storage device may be the memory 11 of the electronic device 1, or may be another storage device communicatively connected to the electronic device 1.

For a detailed description of the above steps, please refer to the functional block diagram of the embodiment of the image description generating apparatus 100 described above with reference to fig. 2 and the description of the flowchart of the embodiment of the image description generating method described above with reference to fig. 1.

Furthermore, the embodiment of the invention also provides a computer readable medium, which can be nonvolatile or volatile. The computer readable medium may be any one or any combination of several of a hard disk, a multimedia card, an SD card, a flash memory card, SMC, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, and the like. The computer readable medium includes a storage data area storing data created according to use of blockchain nodes and a storage program area storing an image description generation 10, the image description generation 10 implementing the following operations when executed by a processor:

The embodiment of the computer readable medium of the present invention is substantially the same as the embodiment of the image description generating method described above, and will not be described herein.

In another embodiment, in the image description generating method provided by the present invention, in order to further ensure the privacy and security of all the data that appear, all the data may also be stored in a node of a blockchain. Such as region characteristics, category embedding, which may all be stored in the blockchain node.

It should be noted that, the blockchain referred to in the present invention is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, etc. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

It should be noted that, the foregoing reference numerals of the embodiments of the present invention are merely for describing the embodiments, and do not represent the advantages and disadvantages of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a medium as described above (e.g. ROM/RAM, magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, an electronic device, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. An image description generation method, characterized in that the method comprises:

embedding the output value and the category into a decoder of the preset transformation model for processing, and generating a description text of the image to be detected;

The step of inputting the region characteristics into a preset tag attention model for weight calculation and outputting the category embedding of the image to be detected comprises the following steps: matching the target category of the image to be detected with a preset word of a preset multidimensional dictionary according to a preset matching formula to obtain a predicted word and a target label of the target category; encoding and embedding the predicted word according to a preset first attention formula to obtain category embedding of the image to be detected;

The preset tag attention model comprises a plurality of attention modules, each attention module comprises an independent scaling dot product attention function, the predicted word is encoded and embedded according to a preset first attention formula, and the class embedding of the image to be detected is obtained, and the method comprises the following steps: a1, inputting the predicted word into a matrix of a first attention module for weight calculation according to the preset first attention calculation formula and the scaling dot product attention function, and outputting a first weight value of the first attention module; a2, inputting the first weight into a matrix of a second attention module for weight calculation, and outputting a second weight value of the second attention module; a3, repeating the steps A1-A2 to obtain weight values of all attention modules, splicing all weight values according to a series splicing function, and outputting the category embedding of the image to be detected.

2. The method for generating an image description according to claim 1, wherein inputting the image to be measured into a preset target detection model for recognition, outputting the region characteristics of the image to be measured, comprises:

3. The image description generation method according to claim 2, wherein the preset geometric relationship calculation formula includes:

wherein, For the regional characteristics of the image to be detected,/>As the center point coordinate of the a-th frame of the image to be detected,/>As the center point coordinate of the b-th frame of the image to be detected,/>For the width and height of the a-th frame,/>And the width and the height of the b frame are the same.

4. The image description generation method according to claim 1, wherein the encoder includes a plurality of identical encoding layers, each encoding layer including a multi-head self-attention sub-layer and a position feedforward sub-layer, the multi-head self-attention sub-layer including a plurality of parallel head modules, the encoder inputting the region characteristics into a preset transformation model to process, and outputting output values of the encoder, comprising:

B4, repeating the B1-B3 to output the output values of all the coding layers.

5. The image description generation method according to claim 1, wherein the decoder includes a plurality of identical decoding layers, each decoding layer including a masking multi-headed self-attention sub-layer, multi-headed cross-attention sub-layer, and a positional forward sub-layer, the decoder for embedding the output values and the categories into the preset transformation model for processing, and generating the description text of the image to be measured includes:

6. An image description generation apparatus for implementing the image description generation method according to any one of claims 1 to 5, characterized in that the apparatus comprises:

7. An electronic device, the electronic device comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores a program executable by the at least one processor to enable the at least one processor to perform the image description generation method according to any one of claims 1 to 5.

8. A computer readable medium storing an image description generation which, when executed by a processor, implements the image description generation method according to any one of claims 1 to 5.