CN114519395B

CN114519395B - Model training method and device, text abstract generating method and device and equipment

Info

Publication number: CN114519395B
Application number: CN202210160816.5A
Authority: CN
Inventors: 舒畅; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2024-05-14
Anticipated expiration: 2042-02-22
Also published as: WO2023159763A1; CN114519395A

Abstract

The embodiment relates to the technical field of artificial intelligence, in particular to a training method and device of a model, a text abstract generation method and device and equipment. The training method of the model comprises the steps of carrying out multi-mode coding processing on original image data to obtain an original image vector, and carrying out multi-mode coding processing on original text data to obtain an original text vector; obtaining original abstract data according to the original text data and the original image data; vectorizing the original abstract data to obtain an original abstract vector; constructing a first positive example pair according to the original abstract vector and the corresponding original text vector; constructing a second positive example pair according to the original text vector and the corresponding original text vector; and performing contrast learning training on the original abstract generating model through the original abstract data, the plurality of first positive example pairs and the plurality of second positive example pairs to obtain a text abstract generating model. The technical scheme of the embodiment of the application can improve the accuracy of the text abstract generated by the text abstract generation model.

Description

Model training method and device, text abstract generating method and device and equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a training method and device of a model, a text abstract generation method and device and equipment.

Background

With the increasing number of video sharing platforms, people can view images, video and text anywhere and anytime. Multimodal summaries aim to concentrate information in multiple modalities into a short, concise and readable text summary in order for a user to quickly and easily understand subject information in an image or video.

In the related art, when the multi-mode text abstract is generated, only the characteristics of the text and the image are considered in isolation, so that the generated multi-mode text abstract is inaccurate.

Disclosure of Invention

The embodiment of the application mainly aims to provide a training method and device for a model, a text abstract generating method and device and equipment, and the accuracy of the text abstract generated by the text abstract generating model can be improved.

To achieve the above object, a first aspect of an embodiment of the present application provides a training method for training a text abstract generating model, including:

acquiring at least two original training data; the original training data comprises original image data and original text data, and the original image data corresponds to the original text data one by one;

Performing multi-mode coding processing on the original image data to obtain an original image vector, and performing multi-mode coding processing on the original text data to obtain an original text vector;

obtaining original abstract data according to the original text data and the original image data;

vectorizing the original abstract data to obtain an original abstract vector;

constructing a first positive example pair according to the original abstract vector and the original text vector;

constructing a second positive example pair according to the original text vector and the original image vector;

And performing contrast learning training on the original abstract generating model through the original abstract data, the plurality of first positive example pairs and the plurality of second positive example pairs to obtain the text abstract generating model.

In some embodiments, the performing multi-mode encoding processing on the original image data to obtain an original image vector, and performing multi-mode encoding processing on the original text data to obtain an original text vector, including:

performing cross-modal coding on the original text data according to a preset cross-modal coder to obtain an original text matrix;

Carrying out pooling mapping treatment on the original text matrix to obtain the original text vector;

performing cross-modal coding on the original image data according to the cross-modal coder and the original text data to obtain an original image matrix;

and carrying out pooling mapping processing on the original image matrix to obtain the original image vector.

In some embodiments, the obtaining the original summary data from the original text data and the original image data includes:

splicing the original text matrix and the original image matrix to obtain a target abstract matrix;

and decoding the target digest matrix according to a preset decoder to obtain the original digest data.

In some embodiments, the cross-modal encoding the original image data according to the cross-modal encoder and the original text data to obtain an original image matrix includes:

Precoding the original text data to obtain a text sub-vector matrix;

precoding the original image data to obtain an image secondary vector matrix;

acquiring a transpose matrix of the image sub-vector matrix to obtain the image sub-vector transpose matrix;

and carrying out iterative processing according to the text sub-vector matrix, the image sub-vector transpose matrix and the image sub-vector matrix to obtain the original image matrix.

In some embodiments, the performing a contrast learning training on the original abstract generating model through the original abstract data, the plurality of first positive example pairs and the plurality of second positive example pairs to obtain a text abstract generating model includes:

Constructing a first loss function according to the original abstract data, the first positive example pair and the corresponding first negative example pair;

Constructing a second loss function according to the original abstract data, the second positive example pair and the corresponding second negative example pair;

obtaining a target loss function according to the first loss function and the second loss function;

and carrying out parameter fine tuning on the original abstract generating model according to the target loss function to obtain the text abstract generating model.

To achieve the above object, a second aspect of the embodiments of the present application provides a text abstract generating method, including:

Acquiring text data to be generated and image data to be generated;

Inputting the text data to be generated and the image data to be generated into a text abstract generation model to generate a target text abstract; the text abstract generation model is trained according to the training method according to any one of the embodiments of the first aspect.

To achieve the above object, a third aspect of the embodiments of the present application provides a training device for training a text abstract generating model, the training device including:

The first acquisition module is used for acquiring at least two original training data; the original training data comprises original image data and original text data, and the original image data corresponds to the original text data one by one;

The encoding module is used for carrying out multi-mode encoding processing on the original image data to obtain an original image vector, and carrying out multi-mode encoding processing on the original text data to obtain an original text vector;

the first processing module is used for obtaining original abstract data according to the original text data and the original image data;

the second processing module is used for carrying out vectorization processing on the original summary data to obtain an original summary vector;

The first construction module is used for constructing a first positive example pair according to the original abstract vector and the original text vector;

the second construction module is used for constructing a second positive example pair according to the original text vector and the original image vector;

and the training module is used for carrying out contrast learning training on the original abstract generating model through the original abstract data, the plurality of first positive example pairs and the plurality of second positive example pairs to obtain the text abstract generating model.

To achieve the above object, a fourth aspect of the embodiments of the present application provides a text digest generating apparatus, including:

the second acquisition module is used for acquiring text data to be generated and image data to be generated;

The abstract generation module is used for inputting the text data to be generated and the image data to be generated into a text abstract generation model to generate a target text abstract; the text abstract generation model is trained according to the training method according to any one of the embodiments of the first aspect.

To achieve the above object, a fifth aspect of an embodiment of the present application proposes an electronic device, including; at least one memory;

at least one processor;

At least one program;

the program is stored in the memory, and the processor executes the at least one program to implement:

A training method as in any of the embodiments of the first aspect; or alternatively

A text excerpt generation method as in the embodiment of the second aspect.

To achieve the above object, a sixth aspect of the embodiments of the present application proposes a storage medium that is a computer-readable storage medium storing computer-executable instructions for causing a computer to execute:

A text excerpt generation method as in the embodiment of the second aspect.

According to the training method, the training device, the text abstract generating method, the text abstract generating device and the training equipment for the model, original training data are obtained, then original image data in the original training data are subjected to multi-mode coding processing to obtain original image vectors, original text data in the original training data are subjected to multi-mode coding processing to obtain original text vectors, the original abstract data are obtained according to the original text data and the original image data, vectorization processing is carried out on the original abstract data to obtain the original abstract vectors, a first positive example pair is built according to the original abstract vectors and the original text vectors, a second positive example pair is built according to the original text vectors and the original image vectors, and finally comparison learning training is carried out on the original abstract generating model through the original abstract data, the first positive example pairs and the second positive example pairs to obtain the text abstract generating model. Through the arrangement, the obtained text abstract generation model has the text abstract generation capability, the semantic representation capability of multi-mode data of texts and images can be enhanced, and the original abstract generation model is subjected to contrast learning training through the original abstract data, a plurality of first positive examples and a plurality of second positive examples, so that the text abstract generation model can fully consider the relation between the texts and the images and the relation between the texts and the target abstract when generating the target abstract, and the accuracy of the text abstract generated by the text abstract generation model is improved.

Drawings

FIG. 1 is a flow chart of a training method for a model provided by an embodiment of the present application;

FIG. 2 is a flowchart of a specific method of step S200 in FIG. 1;

FIG. 3 is a flowchart of a specific method of step S300 in FIG. 1;

FIG. 4 is a flowchart of a specific method of step S230 in FIG. 2;

FIG. 5 is a flowchart of a specific method of step S700 in FIG. 1;

FIG. 6 is a flowchart of a text summary generation method provided by an embodiment of the present application;

FIG. 7 is a block diagram of a training device for a model provided by an embodiment of the present application;

Fig. 8 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

First, several nouns involved in the present application are parsed:

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Natural language processing (natural language processing, NLP): NLP is a branch of artificial intelligence that is a interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics, and is processed, understood, and applied to human languages (e.g., chinese, english, etc.). Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, voice recognition and text-to-speech conversion, information retrieval, information extraction and filtering, text classification and clustering, public opinion analysis and viewpoint mining, and the like, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like.

Coding (encoder): the input sequence is converted into a vector of fixed length.

A cross-modality encoder (CrossModal Encoder): cross-modal coding refers to interactive coding between input sequences. Such as interactive coding between language and image, cross-modal coding between language and image is realized by a language coder and an image coder.

Decoding (decoder): reconverting the previously generated fixed vector into an output sequence; wherein the input sequence can be words, voice, images and video; the output sequence may be text, images.

Mean pooling (MeanPooling): averaging pooling refers to averaging all values in a local acceptance domain.

Multimodal abstract (Multimodal Abstractive Summarization, MAS): multimodal summaries aim to concentrate information in multiple modalities (such as images, text and audio) into a short, concise and readable text summary.

With the increase of the number of video sharing platforms, users can watch images, videos and texts at any time and any place, however, the information included in the images, videos and texts is quite popular, and users can hardly grasp the key points of the information directly, so that the images, videos and texts need to be analyzed to acquire important information in the important information and present the important information to the users in the form of texts, and the multi-mode abstract just meets the requirement.

The tasks of the multimodal abstract include identifying a subject and generating words based on an understanding of the input. In the related art, language and visual characteristics are combined, and a summary is generated by adopting a hierarchical attention method based on the Seq2 Seq. seq2seq is a very important and popular model in the current natural language processing technology, and comprises an encoder (Encoder) to vector the input information and integrate the semantics, a Decoder (Decoder) to obtain text output by decoding intermediate vectors, and supervised training data to train parameters in the encoder and Decoder, and finally fitting the target model. The method does not consider the problem of semantic alignment among data of different modes, and meanwhile, redundant information is easily accumulated due to direct information fusion. This approach only considers the characteristics of the text and images in isolation, and does not consider the consistency between the text and images, when generating the multimodal text excerpt, resulting in inaccuracy of the generated multimodal text excerpt.

Based on the above, the embodiment of the application provides a training method and device for a model, a text abstract generating method and device and equipment, and the accuracy of generating the abstract by the model can be improved.

The embodiment of the application provides a training method and device of a model, a text abstract generating method and device and equipment, and particularly the training method of the model in the embodiment of the disclosure is described firstly through the following embodiment.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides a training method and a text abstract generation method for a model, relates to the technical field of artificial intelligence, and particularly relates to the technical field of data mining. The training method or the text abstract generating method of the model provided by the embodiment of the application can be applied to the terminal, can be applied to the server side, and can also be software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, or smart watch, etc.; the server can be an independent server, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), basic cloud computing services such as big data and artificial intelligent platforms, and the like; the software may be an application or the like that implements a training method of a model or a text digest generation method, but is not limited to the above form.

The application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Embodiments of the present application are further described below with reference to the accompanying drawings.

Referring to fig. 1, some embodiments of the present application provide a training method for training a text summarization generating model according to a first aspect, wherein the training method includes, but is not limited to, steps S100 to S700, and the following describes the seven steps in detail with reference to fig. 1.

Step S100: acquiring at least two original training data; the original training data comprises original image data and original text data, and the original image data corresponds to the original text data one by one.

In step S100 of some embodiments, the original training data is a training set, where the original training data includes original image data and original text data, and the original image data and the original text data are in one-to-one correspondence. The original image data is denoted by I, the original text data is denoted by T, and specific contents of the original image data and the original text data are not particularly limited in the embodiment of the present application.

Step S200: and carrying out multi-mode coding processing on the original image data to obtain an original image vector, and carrying out multi-mode coding processing on the original text data to obtain an original text vector.

Step S300: and obtaining the original abstract data according to the original text data and the original image data.

Step S400: and carrying out vectorization processing on the original abstract data to obtain an original abstract vector.

In step S400 of some embodiments, the original summary data is represented by Y, the original summary data Y is encoded by a cross-mode encoder to obtain an original summary matrix, and then the original summary matrix is subjected to mean pooling and mapping to obtain feature vectors corresponding to the original summary data, namely, original summary vectors, which are used in the methodAnd (3) representing.

Step S500: and constructing a first positive example pair according to the original abstract vector and the original text vector.

In step S500 of some embodiments, when training the input/output text consistency of the model, the input training set is a batch (batch), and a batch includes M samples, so in order to ensure that the input original text data and the original summary data have the subject consistency, a first positive example pair needs to be constructed by the original summary vector and the original text vector corresponding to the original summary vector, and other original text data and the original image data can be regarded as a first negative example pair of the first positive example pair. For example, a certain batch may be expressed as: b= { D ₁,D₂,...,D_M }, for each D _i in B, the original text data T _i and the corresponding original summary data Y _i are included, in the same batch, the original text data T _i and the corresponding original summary data Y _i of a sample can be regarded as a first positive example pair, and the original text data and the original summary data of other samples in the batch are corresponding first negative example pairs. By the arrangement, the model can learn the theme consistency of the original text data and the original abstract data, so that the training target of the input-output text consistency is completed.

Step S600: and constructing a second positive example pair according to the original text vector and the original image vector.

In step S600 of some embodiments, similar to step S500 described above, when training the consistency of the input text and the input image of the model, the input training set is a batch (batch), and a batch includes M samples, so in order to ensure that the input original text data and the original image data have the subject consistency, it is necessary to construct a second positive example pair from the original text vector and the original image vector corresponding to the original text vector, and the other original image data and the original text data can be regarded as a second negative example pair of the second positive example pair. For example, a certain batch may be expressed as: b= { D ₁,D₂,...,D_M }, each D _i in B includes original text data T _i and corresponding original image data I _i, in the same batch, the original text data T _i and corresponding original image data I _i of a sample can be regarded as a second positive example pair, and the original text data and original image data of other samples in the batch are corresponding second negative example pairs. By the arrangement, the model can learn the theme consistency of the original text data and the original image data, so that the training target of the consistency of the input text image is completed.

Step S700: and performing contrast learning training on the original abstract generating model through the original abstract data, the plurality of first positive example pairs and the plurality of second positive example pairs to obtain a text abstract generating model.

In step S700 of some embodiments, the target abstract is a text abstract, and the original abstract generating model may be represented by formula (1), where formula (1) is specifically:

In formula (1), D represents a sample in the training process, y _j represents the j-th word of the target abstract, and in the embodiment of the application, the target abstract is composed of words one by one, and the target abstract is obtained by generating a plurality of words. According to the embodiment of the application, the comparison learning training is carried out on the parameter theta in the original abstract generation model through the original abstract data, the plurality of first positive example pairs and the plurality of second positive example pairs, so that the model capable of simultaneously representing the input text input image consistency and the input text and output abstract consistency is obtained, the image data and the text data can be considered when the model generates the target text abstract, and the accuracy of the text abstract generated by the text abstract generation model is improved.

According to the training method of the model, original training data are obtained, then multi-mode encoding processing is conducted on original image data in the original training data, an original image vector is obtained, multi-mode encoding processing is conducted on original text data in the original training data, an original text vector is obtained, original abstract data are obtained according to the original text data and the original image data, vectorization processing is conducted on the original abstract data, an original abstract vector is obtained, a first positive example pair is built according to the original abstract vector and the original text vector corresponding to the original abstract vector, a second positive example pair is built according to the original text vector and the original image vector corresponding to the original text vector, and finally comparison learning training is conducted on an original abstract generation model through the original abstract data, the first positive example pairs and the second positive example pairs, and a text abstract generation model is obtained. Through the arrangement, the obtained text abstract generation model has the text abstract generation capability, the semantic representation capability of multi-mode data of texts and images can be enhanced, and the original abstract generation model is subjected to contrast learning training through the original abstract data, a plurality of first positive examples and a plurality of second positive examples, so that the text abstract generation model can fully consider the relation between the texts and the images and the relation between the texts and the target abstract when generating the target abstract, and the accuracy of the text abstract generated by the text abstract generation model is improved.

Referring to fig. 2, in some embodiments of the present application, step S200 includes step S210, step S220, step S230 and step S240, and it should be understood that step S200 includes, but is not limited to, step S210 to step S240, and the following description will describe these four steps in detail with reference to fig. 2.

Step S210: and performing cross-modal coding on the original text data according to a preset cross-modal coder to obtain an original text matrix.

In step S210 of some embodiments, the original text data is cross-modal encoded by a cross-modal encoder to obtain an original text matrix, where H ^T represents the original text matrix, H ^T∈R^L×D, where L represents the length of the original text data, and D represents the dimension of the mapping vector.

Step S220: and carrying out pooling mapping processing on the original text matrix to obtain an original text vector.

In step S220 of some embodiments, the original text matrix obtained in the foregoing step S210 is subjected to mean pooling, and then is subjected to mapping operation, so as to obtain an original text vector, which is used forRepresenting the original text vector.

Step S230: and performing cross-modal coding on the original image data according to the cross-modal coder and the original text data to obtain an original image matrix.

Step S240: and carrying out pooling mapping treatment on the original image matrix to obtain an original image vector.

In steps S230 to S240, the obtained original image matrix is denoted by H ^I, H ^I∈R^L×D, and the original image vector is used in a similar manner to the previous stepsAnd (3) representing.

Referring to fig. 3, in some embodiments of the present application, step S300 includes, but is not limited to, step S310 and step S320, which are described in detail below in conjunction with fig. 3.

Step S310: splicing the original text matrix and the original image matrix to obtain a target abstract matrix;

step S320: and decoding the target abstract matrix according to a preset decoder to obtain original abstract data.

In steps S310 to S320 of some embodiments, vector stitching is performed on the original text matrix H ^T and the original image matrix H ^I to obtain a target summary matrix, which is denoted by H, and features of the original text data and the original image data are combined together through stitching operation, so that the subsequent original generation model can learn the features of the original text data and the original image data. Then, the target digest matrix H is decoded by a decoder to obtain the original digest data Y. However, the original summary data obtained by such processing does not consider the subject matter consistency between the original text data and the original image data and the subject matter consistency between the original text data and the original summary data, and therefore, the first positive example pair and the second positive example pair of the original summary data and the subsequent construction need to be subjected to contrast learning training so that the text summary generation model learns the subject matter consistency between the original text data and the original image data and the subject matter consistency between the original text data and the original summary data.

Referring to fig. 4, in some embodiments of the present application, step S230 includes, but is not limited to, step S231, step S232, step S233, and step S234, which are described in detail below with reference to fig. 4.

Step S231: precoding the original text data to obtain a text sub-vector matrix;

step S232: pre-coding the original image data to obtain an image secondary vector matrix;

Step S233: acquiring a transpose matrix of the image sub-vector matrix to obtain the transpose matrix of the image sub-vector;

Step S234: and carrying out iterative processing according to the text sub-vector matrix, the image sub-vector transpose matrix and the image sub-vector matrix to obtain an original image matrix.

Specifically, in this embodiment, assuming that the original text data includes L words, the original text data is first precoded to obtain a text sub-vector matrix H ^T1, the original image data is precoded to obtain an image sub-vector matrix H ^I1, and a transpose matrix of the image sub-vector matrix is obtained to obtain an image sub-vector transpose matrixAnd then carrying out iterative processing according to the text sub-vector matrix, the image sub-vector transpose matrix and the image sub-vector matrix to obtain an original image matrix. Specifically, the expression can be represented by the following formula:

In the formula (2), the vector on the left side of the formula is repeatedly updated N times to finally obtain the original image matrix H ^I, the first input is H ^T1 and H ^I1, the last N-1 layers adopt the output result of the last layer, The vector normalization operation is used for smoothing the value, if the dimension is large, the calculated value is large, and the value is smaller after dividing by the root number D, so that the value is smoother.

Similarly, an original text matrix is obtained by adopting a formula (3), wherein the formula (3) is specifically as follows:

Referring to fig. 5, in some embodiments of the present application, step S700 includes, but is not limited to, step S710, step S720, step S730, and step S740, which are described in detail below with reference to fig. 5.

Step S710: and constructing a first loss function according to the original abstract data, the first positive example pair and the corresponding first negative example pair.

Specifically, in step S710 of some embodiments, when training the input/output text consistency of the model, the input training set is a batch (batch), and a batch includes M samples, so in order to ensure that the input original text data and the original abstract data have the subject consistency, a first positive example pair needs to be constructed by the original abstract vector and the original text vector corresponding to the original abstract vector, and other original text data and the original image data can be regarded as a first negative example pair of the first positive example pair. For example, a certain batch may be expressed as: b= { D ₁,D₂,...,D_M }, for each D _i in B, the original text data T _i and the corresponding original summary data Y _i are included, in the same batch, the original text data T _i and the corresponding original summary data Y _i of a sample can be regarded as a first positive example pair, and the original text data and the original summary data of other samples in the batch are corresponding first negative example pairs.

Constructing a first loss function according to the original abstract data, the first positive example pair and the corresponding first negative example pair obtained in the steps, wherein the first loss function is expressed by a formula (4), and the formula (4) is as follows:

In equation (4), sim () represents a calculation function that calculates cos between two vectors, τ represents a hyper-parameter that is used to control the speed of model fitting, Representing the original text vector,/>Representing the original summary vector.

Step S720: and constructing a second loss function according to the original abstract data, the second positive example pair and the corresponding second negative example pair.

In some steps S720, when the input text and the input image of the training model are consistent, the input training set is a batch, and a batch includes M samples, so in order to ensure that the input original text data and the original image data have the subject consistency, a second positive example pair needs to be constructed by the original text vector and the original image vector corresponding to the original text vector, and other original image data and the original text data can be regarded as a second negative example pair of the second positive example pair. For example, a certain batch may be expressed as: b= { D ₁,D₂,...,D_M }, each D _i in B includes original text data T _i and corresponding original image data I _i, in the same batch, the original text data T _i and corresponding original image data I _i of a sample can be regarded as a second positive example pair, and the original text data and original image data of other samples in the batch are corresponding second negative example pairs.

Similar to the previous step, a second loss function is constructed according to the original summary data, the second positive example pair and the corresponding second negative example pair obtained in the previous step, the second loss function is represented by formula (5), and formula (5) is:

In equation (5), sim () represents a calculation function that calculates cos between two vectors, τ represents a hyper-parameter that is used to control the speed of model fitting, Representing the original text vector,/>Representing the original image vector.

Step S730: and obtaining a target loss function according to the first loss function and the second loss function.

In step S730 of some embodiments, a target loss function is constructed according to the first loss function and the second loss function, where the target loss function is represented by formula (6), and formula (6) is specifically:

Step S740: and carrying out parameter fine tuning on the original abstract generating model according to the target loss function to obtain the text abstract generating model.

In step S740 of some embodiments, the parameter θ in the original digest-generation model shown in formula (1) is fine-tuned by the target loss function, thereby obtaining a text digest-generation model.

Referring to fig. 6, in a second aspect, some embodiments of the present application further provide a text summarization generating method, which includes step S800 and step S900, and it should be understood that the text summarization generating method according to the embodiments of the present application includes, but is not limited to, step S800 and step S900, and the two steps are described in detail below with reference to fig. 6.

Step S800: acquiring text data to be generated and image data to be generated;

step S900: inputting the text data to be generated and the image data to be generated into a text abstract generating model to generate a target text abstract; the text abstract generation model is trained according to the training method of any one of the embodiments of the first aspect.

In this embodiment, to-be-generated text data and to-be-generated image data which need to generate multi-modal summaries are input into a text summary generation model obtained through training in the embodiment of the first aspect, so as to obtain multi-modal target text summaries corresponding to the to-be-generated text data and the to-be-generated image data.

According to the text abstract generation method, original training data are obtained, then multi-mode encoding processing is conducted on original image data in the original training data, an original image vector is obtained, multi-mode encoding processing is conducted on original text data in the original training data, an original text vector is obtained, original abstract data are obtained according to the original text data and the original image data, vectorization processing is conducted on the original abstract data, an original abstract vector is obtained, a first positive example pair is built according to the original abstract vector and the original text vector corresponding to the original abstract vector, a second positive example pair is built according to the original text vector and the original image vector corresponding to the original text vector, and finally comparison learning training is conducted on an original abstract generation model through the original abstract data, the first positive example pairs and the second positive example pairs, so that a text abstract generation model is obtained. Through the arrangement, the obtained text abstract generation model has the text abstract generation capability, the semantic representation capability of multi-mode data of texts and images can be enhanced, and the original abstract generation model is subjected to contrast learning training through the original abstract data, a plurality of first positive examples and a plurality of second positive examples, so that the text abstract generation model can fully consider the relation between the texts and the images and the relation between the texts and the target abstract when generating the target abstract, and the accuracy of the text abstract generated by the text abstract generation model is improved.

Referring to fig. 7, some embodiments of the present application further provide a training device for training a text summary generation model, where the training device includes a first obtaining module 1000, an encoding module 1100, a first processing module 1200, a second processing module 1300, a first building module 1400, a second building module 1500, and a training module 1600.

A first obtaining module 1000, configured to obtain at least two original training data; the original training data comprises original image data and original text data, and the original image data corresponds to the original text data one by one.

The encoding module 1100 is configured to perform multi-mode encoding processing on the original image data to obtain an original image vector, and perform multi-mode encoding processing on the original text data to obtain an original text vector.

The first processing module 1200 is configured to obtain original summary data according to the original text data and the original image data.

The second processing module 1300 is configured to perform vectorization processing on the original summary data to obtain an original summary vector.

The first construction module 1400 is configured to construct a first positive example pair according to the original abstract vector and the original text vector.

A second construction module 1500, configured to construct a second positive example pair according to the original text vector and the original image vector.

The training module 1600 is configured to perform a contrast learning training on the original abstract generating model through the original abstract data, the plurality of first positive example pairs, and the plurality of second positive example pairs, so as to obtain a text abstract generating model.

According to the training device of the model, original training data are obtained, then original image data in the original training data are subjected to multi-mode coding processing to obtain original image vectors, original text data in the original training data are subjected to multi-mode coding processing to obtain original text vectors, original abstract data are obtained according to the original text data and the original image data, vectorization processing is carried out on the original abstract data to obtain original abstract vectors, a first positive example pair is built according to the original abstract vectors and the original text vectors corresponding to the original abstract vectors, a second positive example pair is built according to the original text vectors and the original image vectors corresponding to the original text vectors, and finally comparison learning training is carried out on an original abstract generation model through the original abstract data, the first positive example pairs and the second positive example pairs to obtain the text abstract generation model. Through the arrangement, the obtained text abstract generation model has the text abstract generation capability, the semantic representation capability of multi-mode data of texts and images can be enhanced, and the original abstract generation model is subjected to contrast learning training through the original abstract data, a plurality of first positive examples and a plurality of second positive examples, so that the text abstract generation model can fully consider the relation between the texts and the images and the relation between the texts and the target abstract when generating the target abstract, and the accuracy of the text abstract generated by the text abstract generation model is improved.

It should be noted that, the training device of the model in the embodiment of the present application corresponds to the foregoing training method of the model, and specific training steps or processing steps may refer to the foregoing training method of the model, which is not described herein in detail.

In a third aspect, some embodiments of the present application further provide a text summarization apparatus, where the text summarization apparatus includes a second obtaining module and a summary generating module.

And the second acquisition module is used for acquiring the text data to be generated and the image data to be generated.

The abstract generation module is used for inputting the text data to be generated and the image data to be generated into the text abstract generation model to generate a target text abstract; the text abstract generation model is trained according to the training method of any one of the embodiments of the first aspect.

According to the text abstract generating device, original training data are obtained, then original image data in the original training data are subjected to multi-mode coding processing to obtain original image vectors, original text data in the original training data are subjected to multi-mode coding processing to obtain original text vectors, original abstract data are obtained according to the original text data and the original image data, vectorization processing is carried out on the original abstract data to obtain original abstract vectors, a first positive example pair is built according to the original abstract vectors and the original text vectors corresponding to the original abstract vectors, a second positive example pair is built according to the original text vectors and the original image vectors corresponding to the original text vectors, and finally comparison learning training is carried out on an original abstract generating model through the original abstract data, the first positive example pairs and the second positive example pairs to obtain a text abstract generating model. Through the arrangement, the obtained text abstract generation model has the text abstract generation capability, the semantic representation capability of multi-mode data of texts and images can be enhanced, and the original abstract generation model is subjected to contrast learning training through the original abstract data, a plurality of first positive examples and a plurality of second positive examples, so that the text abstract generation model can fully consider the relation between the texts and the images and the relation between the texts and the target abstract when generating the target abstract, and the accuracy of the text abstract generated by the text abstract generation model is improved.

It should be noted that, the text abstract generating device in the embodiment of the present application corresponds to the foregoing text abstract generating method, and specific operation steps or flow please refer to the foregoing text abstract generating method, which is not described herein in detail.

The embodiment of the disclosure also provides an electronic device, including:

At least one memory;

at least one processor;

At least one program;

The program is stored in the memory, and the processor executes at least one program to implement the training method or the text digest generation method of the present disclosure to implement the model described above. The electronic device may be any intelligent terminal including a mobile phone, a tablet computer, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a vehicle-mounted computer, and the like.

According to the electronic equipment, through executing the training method or the text abstract generating method of the model, original training data are obtained, then multi-mode encoding processing is conducted on original image data in the original training data to obtain original image vectors, multi-mode encoding processing is conducted on original text data in the original training data to obtain original text vectors, original abstract data are obtained according to the original text data and the original image data, vectorization processing is conducted on the original abstract data to obtain the original abstract vectors, first positive example pairs are built according to the original abstract vectors and the original text vectors corresponding to the original abstract vectors, second positive example pairs are built according to the original text vectors and the original image vectors corresponding to the original text vectors, and finally comparison learning training is conducted on the original abstract generating model through the original abstract data, the first positive example pairs and the second positive example pairs to obtain the text abstract generating model. Through the arrangement, the obtained text abstract generation model has the text abstract generation capability, the semantic representation capability of multi-mode data of texts and images can be enhanced, and the original abstract generation model is subjected to contrast learning training through the original abstract data, a plurality of first positive examples and a plurality of second positive examples, so that the text abstract generation model can fully consider the relation between the texts and the images and the relation between the texts and the target abstract when generating the target abstract, and the accuracy of the text abstract generated by the text abstract generation model is improved.

An electronic device according to an embodiment of the present application is described in detail below with reference to fig. 8.

As shown in fig. 8, fig. 8 illustrates a hardware structure of an electronic device of another embodiment, the electronic device includes:

The processor 1700 may be implemented by a general purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present disclosure;

The Memory 1800 may be implemented in the form of Read Only Memory (ROM), static storage, dynamic storage, or random access Memory (Random Access Memory, RAM). Memory 1800 may store an operating system and other application programs, and when implementing the technical solutions provided by the embodiments of the present disclosure by software or firmware, relevant program code is stored in memory 1800 and invoked by processor 1700 to execute the training method or the text digest generation method of the model of the embodiments of the present disclosure;

An input/output interface 1900 for inputting and outputting information;

The communication interface 2000 is configured to implement communication interaction between the present device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

Bus 2100 transmits information between the various components of the device (e.g., processor 1700, memory 1800, input/output interface 1900, and communication interface 2000);

Wherein the processor 1700, the memory 1800, the input/output interface 1900, and the communication interface 2000 enable communication connections within the device between each other via the bus 2100.

The embodiment of the present disclosure also provides a storage medium that is a computer-readable storage medium storing computer-executable instructions for causing a computer to perform the training method or the text digest generation method of the above model.

According to the storage medium, through executing the training method or the text abstract generating method of the model, original training data are obtained, then multi-mode encoding processing is carried out on original image data in the original training data to obtain original image vectors, multi-mode encoding processing is carried out on original text data in the original training data to obtain original text vectors, original abstract data are obtained according to the original text data and the original image data, vectorization processing is carried out on the original abstract data to obtain the original abstract vectors, a first positive example pair is built according to the original abstract vectors and the original text vectors corresponding to the original abstract vectors, a second positive example pair is built according to the original text vectors and the original image vectors corresponding to the original text vectors, and finally comparison learning training is carried out on the original abstract generating model through the original abstract data, the first positive example pairs and the second positive example pairs to obtain the text abstract generating model. Through the arrangement, the obtained text abstract generation model has the text abstract generation capability, the semantic representation capability of multi-mode data of texts and images can be enhanced, and the original abstract generation model is subjected to contrast learning training through the original abstract data, a plurality of first positive examples and a plurality of second positive examples, so that the text abstract generation model can fully consider the relation between the texts and the images and the relation between the texts and the target abstract when generating the target abstract, and the accuracy of the text abstract generated by the text abstract generation model is improved.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiments described in the embodiments of the present disclosure are for more clearly describing the technical solutions of the embodiments of the present disclosure, and do not constitute a limitation on the technical solutions provided by the embodiments of the present disclosure, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present disclosure are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not limit the embodiments of the present disclosure, and may include more or fewer steps than shown, or may combine certain steps, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions for causing an electronic device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing a program.

Preferred embodiments of the disclosed embodiments are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the disclosed embodiments. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present disclosure shall fall within the scope of the claims of the embodiments of the present disclosure.

Claims

1. A method of training a model, the method for training a text abstract generation model, the method comprising:

vectorizing the original abstract data to obtain an original abstract vector;

2. The training method according to claim 1, wherein the performing multi-modal encoding on the original image data to obtain an original image vector, and performing multi-modal encoding on the original text data to obtain an original text vector, includes:

3. The training method of claim 2, wherein the obtaining raw summary data from the raw text data and the raw image data comprises:

4. The training method according to claim 2, wherein the cross-modal encoding the original image data according to the cross-modal encoder and the original text data to obtain an original image matrix includes:

Precoding the original text data to obtain a text sub-vector matrix;

precoding the original image data to obtain an image secondary vector matrix;

5. The training method according to any one of claims 1 to 4, wherein the performing a contrast learning training on the original abstract generating model by the original abstract data, the plurality of first positive example pairs, and the second positive example pairs to obtain a text abstract generating model includes:

6. A method for generating a text excerpt, the method comprising:

Acquiring text data to be generated and image data to be generated;

Inputting the text data to be generated and the image data to be generated into a text abstract generation model to generate a target text abstract; the text abstract generation model is trained according to the training method of any one of claims 1 to 5.

7. A training device for training a text abstract generation model, the training device comprising:

8. A text digest generation apparatus, characterized in that the text digest generation apparatus comprises:

The abstract generation module is used for inputting the text data to be generated and the image data to be generated into a text abstract generation model to generate a target text abstract; the text abstract generation model is trained according to the training method of any one of claims 1 to 5.

9. An electronic device, comprising:

At least one memory;

at least one processor;

At least one program;

The training method of any one of claims 1 to 5; or alternatively

The text excerpt generation method of claim 6.

10. A storage medium that is a computer-readable storage medium, wherein the computer-readable storage medium stores computer-executable instructions for causing a computer to perform:

The training method of any one of claims 1 to 5; or alternatively

The text excerpt generation method of claim 6.