CN114842488A

CN114842488A - Image title text determination method and device, electronic equipment and storage medium

Info

Publication number: CN114842488A
Application number: CN202210467945.9A
Authority: CN
Inventors: 刘鎏; 周鑫; 左凯; 曹佐; 张弓
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-08-02

Abstract

The embodiment of the disclosure provides an image title text determination method and device, electronic equipment and a storage medium. The method comprises the following steps: acquiring text features and image visual features corresponding to a target image; inputting the text features and the image visual features into a target title text extraction model; and processing the text features and the image visual features based on the target title text extraction model, and determining a target image title text corresponding to the target image. The embodiment of the disclosure judges whether the image has the editing title or not and accurately extracts the editing title in the ranking chart by fully utilizing the cross-modal information interaction of different granularity characteristics and text characteristics of the image, and improves the quality and the relevance of the image preferably to a certain extent.

Description

Image title text determination method and device, electronic equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of image processing, and in particular relates to a method and a device for determining an image title text, an electronic device and a storage medium.

Background

With the rapid development of the internet and the wide application of social media in recent years, people are more and more accustomed to uploading own what they see anytime and anywhere, and the proportion of multimodal data in the business is increased. In the popular comment search service, one picture is displayed on each evaluation or note uploaded by each user in the current search interface, but most of the users actually match more than one picture for the evaluation or note. Therefore, how to select one picture from the pictures to be displayed in the search interface so as to optimize the search experience of the user to become a big attack direction of the current business.

In order to select the high-quality and valuable pictures, the following two schemes are mainly used in the preferred scenes of the current display picture:

1. based on picture quality preference. Most of current picture optimization scenes are based on the evaluation of the quality of the pictures, and after the quality of the pictures is scored, the pictures are selected and displayed after being sorted according to scores;

2. the preference is understood on the basis of text. The current image understanding method is based on a graph-text multi-mode fusion method, and by grading the similarity of texts and images, the image with the highest similarity degree is selected for display.

The two schemes respectively select the display picture from the visual aesthetic quality and the image-text correlation of the picture, and optimize the feeling of a user during searching by improving the quality and the information quantity of the display picture. However, research has found that the click rate and forward feedback (e.g., praise, collection, etc.) of the content with the user-edited title in the display map are superior to the content without the user-edited title, and how to accurately extract the user-edited title in the image is a problem to be solved urgently at present.

Disclosure of Invention

The embodiment of the disclosure provides an image title text determination method, an image title text determination device, electronic equipment and a storage medium, which are used for fully utilizing cross-modal information interaction of different granularity characteristics and text characteristics of an image, judging whether the image has an editing title or not, accurately extracting and sequencing the editing title in an image, and preferably improving the quality and the relevance of the image to a certain extent.

According to a first aspect of embodiments of the present disclosure, there is provided an image title text determination method, including:

acquiring text features and image visual features corresponding to a target image;

inputting the text features and the image visual features into a target title text extraction model;

and processing the text features and the image visual features based on the target title text extraction model, and determining a target image title text corresponding to the target image.

Optionally, the obtaining of the text feature and the image visual feature corresponding to the target image includes:

identifying image text in the target image based on a character recognition technology;

based on the region of the image text in the target image, cutting the target image to generate a text region image and a non-text region image;

and respectively carrying out feature extraction processing on the text region image and the non-text region image based on a pre-training feature extraction model to obtain the text features and the image visual features.

Optionally, before the inputting the text feature and the image visual feature into the target caption text extraction model, the method further includes:

acquiring a sample image; the sample image is an image containing text, and the sample image corresponds to an initial title text label;

acquiring sample text characteristics and sample image visual characteristics corresponding to the sample image;

inputting the sample text features and the sample image visual features into a title text extraction model to be trained; the title text extraction model to be trained comprises the following steps: an encoding layer and a prediction layer;

calling the coding layer to perform fusion processing on the sample text features and the sample image visual features to generate image fusion features;

calling the prediction layer to perform prediction processing on the image fusion characteristics to obtain a prediction title text label;

calculating to obtain a loss value corresponding to the to-be-trained title text extraction model based on the initial title text label and the predicted title text label;

and under the condition that the loss value is within a preset range, taking the trained to-be-trained title text extraction model as the target title text extraction model.

Optionally, the prediction layer comprises: a first caption prediction layer, a second caption prediction layer and a text order prediction layer,

the calling the prediction layer to perform prediction processing on the image fusion characteristics to obtain a prediction title text label, and the method comprises the following steps:

calling the first title prediction layer to perform prediction processing on the image fusion characteristics, and generating a first prediction tag used for judging whether an edited title text is contained in a predicted image;

calling the second title prediction layer to perform prediction processing on the image fusion characteristics, and generating a second prediction label for predicting whether a text forms a user editing title;

calling the text sequence prediction layer to perform prediction processing on the image fusion characteristics, and generating a third prediction label for predicting the sequence of the text in the title text edited by the user;

and determining the predicted title text label according to the first predicted label, the second predicted label and the third predicted label.

Optionally, the target title text extraction model includes: an encoding layer and a prediction layer,

the processing the text features and the image visual features based on the target title text extraction model to determine the target image title text corresponding to the target image comprises:

calling the coding layer to perform fusion processing on the text features and the image visual features to generate target image fusion features;

and calling the prediction layer to perform title prediction processing on the target image fusion characteristics to obtain a target image title text corresponding to the target image.

According to a second aspect of the embodiments of the present disclosure, there is provided an image title text determination apparatus including:

the image characteristic acquisition module is used for acquiring text characteristics and image visual characteristics corresponding to the target image;

the image characteristic input module is used for inputting the text characteristic and the image visual characteristic into a target title text extraction model;

and the image title determining module is used for processing the text features and the image visual features based on the target title text extraction model and determining a target image title text corresponding to the target image.

Optionally, the image feature obtaining module includes:

an image text recognition unit for recognizing an image text in the target image based on a character recognition technique;

a region image generating unit, configured to crop the target image based on a region of the image text within the target image, and generate a text region image and a non-text region image;

and the image characteristic acquisition unit is used for respectively carrying out characteristic extraction processing on the text region image and the non-text region image based on a pre-training characteristic extraction model to obtain the text characteristic and the image visual characteristic.

Optionally, the apparatus further comprises:

the sample image acquisition module is used for acquiring a sample image; the sample image is an image containing text, and the sample image corresponds to an initial title text label;

the sample image characteristic acquisition module is used for acquiring sample text characteristics and sample image visual characteristics corresponding to the sample image;

the sample image characteristic input module is used for inputting the sample text characteristics and the sample image visual characteristics to a title text extraction model to be trained; the title text extraction model to be trained comprises the following steps: an encoding layer and a prediction layer;

the image fusion characteristic generation module is used for calling the coding layer to perform fusion processing on the sample text characteristic and the sample image visual characteristic to generate an image fusion characteristic;

the prediction title label acquisition module is used for calling the prediction layer to perform prediction processing on the image fusion characteristics to obtain a prediction title text label;

the loss value calculation module is used for calculating and obtaining a loss value corresponding to the to-be-trained title text extraction model based on the initial title text label and the predicted title text label;

and the target title extraction model acquisition module is used for taking the trained to-be-trained title text extraction model as the target title text extraction model under the condition that the loss value is within a preset range.

the predicted title tag obtaining module includes:

a first prediction tag generation unit, configured to invoke the first title prediction layer to perform prediction processing on the image fusion feature, and generate a first prediction tag used for determining whether an edited title text is included in a predicted image;

the second prediction label generating unit is used for calling the second title prediction layer to perform prediction processing on the image fusion characteristics and generating a second prediction label used for predicting whether a text forms a user editing title or not;

a third prediction tag generation unit, configured to invoke the text sequence prediction layer to perform prediction processing on the image fusion feature, and generate a third prediction tag used for predicting a sequence of a text editing header text in a user;

a predicted title tag determination unit configured to determine the predicted title text tag according to the first predicted tag, the second predicted tag, and the third predicted tag.

the image title determination module includes:

the target fusion feature generation unit is used for calling the coding layer to perform fusion processing on the text features and the image visual features to generate target image fusion features;

and the target image title acquisition unit is used for calling the prediction layer to perform title prediction processing on the target image fusion characteristics to obtain a target image title text corresponding to the target image.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor, a memory and a computer program stored on the memory and operable on the processor, the processor implementing the image title text determination method of any one of the above when executing the program.

According to a fourth aspect of embodiments of the present disclosure, there is provided a readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform any one of the image title text determination methods described above.

The embodiment of the disclosure provides an image title text determination method, an image title text determination device, electronic equipment and a storage medium. The embodiment of the disclosure judges whether the image has the editing title or not and accurately extracts the editing title in the ranking chart by fully utilizing the cross-modal information interaction of different granularity characteristics and text characteristics of the image, and improves the quality and the relevance of the image preferably to a certain extent.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments of the present disclosure will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a flowchart illustrating steps of a method for determining a title text of an image according to an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating steps of another method for determining a title text of an image according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an image title text determination apparatus according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of another image title text determination apparatus according to an embodiment of the present disclosure.

Detailed Description

Technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present disclosure, belong to the protection scope of the embodiments of the present disclosure.

Example one

Referring to fig. 1, a flowchart illustrating steps of an image title text determination method provided by an embodiment of the present disclosure is shown, and as shown in fig. 1, the image title text determination method may include the following steps:

step 101: and acquiring text features and image visual features corresponding to the target image.

The method and the device for extracting the title text edited by the user in the image can be applied to a scene of interactively extracting the title text edited by the user in the image by utilizing different granularity characteristics of the image and the text cross-modal information.

The target image is an image for which the extraction of the edited text is required, and in this example, the target image may be a dish image, an item image, or the like.

The text features refer to features formed by text within the target image, and in this example, the text features may include: and the image text recognized by adopting a character recognition technology and the characteristics of a text box, a position and the like corresponding to the image text.

The image visual feature refers to a feature formed by images of other areas except image text in the target image.

When extracting an image text edited by a user in a target image, a text feature and an image visual feature corresponding to the target image may be obtained, and a specific obtaining manner will be described in detail in the following second embodiment, which is not described herein again.

After the text feature and the image visual feature corresponding to the target image are acquired, step 102 is executed.

Step 102: and inputting the text features and the image visual features into a target title text extraction model.

The target heading text extraction model refers to a model obtained by pre-training and used for extracting a text in an image, and a training process of the target heading text extraction model will be described in detail in the following second embodiment, which is not described herein again.

After the text features and the image visual features corresponding to the target image are acquired, the acquired text features and the image visual features can be input into the target title text extraction model.

After inputting the text features and image visual features corresponding to the target image into the target title text extraction model, step 103 is executed.

Step 103: and processing the text features and the image visual features based on the target title text extraction model, and determining a target image title text corresponding to the target image.

After the text features and the image visual features corresponding to the target image are input to the target caption text extraction model, the text features and the image visual features may be processed based on the target caption text extraction model to determine a target image caption text corresponding to the target image.

According to the method and the device, cross-modal information interaction of different granularity characteristics and text characteristics of the image is fully utilized, whether the image has the editing title or not is judged, the editing title in the ranking chart is accurately extracted, high-quality image selection can be performed according to the extracted title text in the subsequent process, and the image optimization is improved in quality and relevance to a certain extent.

According to the image title text determining method provided by the embodiment of the disclosure, the text features and the image visual features corresponding to the target image are obtained, the text features and the image visual features are input into the target title text extraction model, the text features and the image visual features are processed based on the target title text extraction model, and the target image title text corresponding to the target image is determined. The embodiment of the disclosure judges whether the image has the editing title or not and accurately extracts the editing title in the ranking chart by fully utilizing the cross-modal information interaction of different granularity characteristics and text characteristics of the image, and improves the quality and the relevance of the image preferably to a certain extent.

Example two

Referring to fig. 2, a flowchart illustrating steps of another method for determining an image title text provided by an embodiment of the present disclosure is shown, and as shown in fig. 2, the method for determining an image title text may include the following steps:

step 201: acquiring a sample image; the sample image is an image containing text, and the sample image corresponds to an initial title text label.

The sample images refer to images used for training a text extraction model, and in this example, the sample images refer to images containing texts, and each sample image corresponds to one initial title text label.

In training the caption text extraction model, a sample image containing text may be acquired.

After the sample image is acquired, step 202 is performed.

Step 202: and acquiring a sample text characteristic and a sample image visual characteristic corresponding to the sample image.

Sample text features refer to features formed by text within a sample image.

The visual features of the sample image refer to features formed by images of other areas in the sample image except text.

After the sample image is acquired, the sample text features and the sample image visual features corresponding to the sample image may be acquired.

In this embodiment, the sample text features and the sample image visual features may be used as input features of the to-be-trained caption text extraction model, and each input feature is formed by five embedded features, which are respectively: image features, text features, bounding box location information features, segment features, and location features, wherein,

image characteristics: firstly, mapping the image features to be in the same space with the text features to obtain corresponding image features.

Text characteristics: and averaging all the OCR text features to obtain the total text feature of the graph, wherein the text feature is formed by the total text feature and the OCR text feature.

Bounding box location information features: and performing information sorting on each OCR identified boundary box to obtain five-dimensional vectors which respectively represent the abscissa of the central point, the ordinate of the central point, the length, the width and the area of the boundary box, wherein all values are between 0 and 1 and are the proportion of the boundary box relative to the whole graph. At the same time, increaseAnd information of the boundary box of the whole graph. Finally, a bounding box position information encoder E is applied _t All bounding box information features are mapped to the same dimension as other features.

Segment characteristics: as with the VQA task in BERT, the entire graph and the different OCR regions are defined as different types of periods, thereby separating the entire graph and the different OCR regions.

Position characteristics: sequence position embedding is added to each input element as in BERT.

In the five embedded features, the text feature, the bounding box position information feature, the segment feature and the position feature form a sample text feature.

After the sample text features and the sample image visual features corresponding to the sample image are obtained, step 203 is executed.

Step 203: inputting the sample text features and the sample image visual features into a title text extraction model to be trained; the title text extraction model to be trained comprises the following steps: an encoding layer and a prediction layer.

After the sample text features and the sample image visual features corresponding to the sample images are obtained, the sample text features and the sample image visual features may be input to a to-be-trained caption text extraction model, where the to-be-trained caption text extraction model may include: an encoding layer and a prediction layer.

After the sample feature features and the sample image visual features are input to the heading text extraction model to be trained, step 204 is performed.

Step 204: and calling the coding layer to perform fusion processing on the sample text characteristics and the sample image visual characteristics to generate image fusion characteristics.

After the sample characteristic features and the sample image visual features are input into the title text extraction model to be trained, the coding layer can be called to perform fusion processing on the sample text features and the sample image visual features so as to generate image fusion features.

In this example, the coding Layer is formed by two layers of transform Encoder Layer networks, and after the sample characteristic features and the sample image visual features are input into the title text extraction model to be trained, the two layers of transform Encoder Layer networks can be called to perform information fusion on the input features. Input information is subjected to exchange learning through an attention mechanism in the Transformer, so that information transmission and communication can be performed on picture features, text features, coarse-grained features and fine-grained features.

After the encoding layer is called to perform the fusion processing on the sample text features and the sample image visual features to generate image fusion features, step 205 is performed.

Step 205: and calling the prediction layer to perform prediction processing on the image fusion characteristics to obtain a prediction title text label.

And after the coding layer is called to perform fusion processing on the sample text features and the sample image visual features to generate image fusion features, the prediction layer can be called to perform prediction processing on the image fusion features to obtain a predicted title text label.

In this example, the prediction layer may be split into three subtasks, with three different prediction networks being used to predict the result: whether there is a user-edited title prediction in the graph, whether the OCR text constitutes a user-edited title prediction, and a sequential prediction of the OCR text in the user-edited title. The first two are two classification tasks, and the third is a multi-classification task. All predictive networks are implemented by a fully connected layer. The process for generating predictive headline text labels may be described in detail in conjunction with the following specific implementation.

In a specific implementation manner of the embodiment of the present disclosure, the prediction layer includes: a first headline prediction layer, a second headline prediction layer, and a text order prediction layer, where step 205 may include:

substep A1: and calling the first title prediction layer to perform prediction processing on the image fusion characteristics, and generating a first prediction label for predicting whether the image contains an editing title text.

In an embodiment of the present disclosure, the prediction layer may include: a first caption prediction layer, a second caption prediction layer, and a text order prediction layer.

After the image fusion feature is acquired, the first title prediction layer may be called to perform prediction processing on the image fusion feature to generate a first prediction tag used for predicting whether the edited title text is contained in the image, that is, the first prediction tag may be used for indicating whether the text edited by the user is contained in the target image.

Substep A2: and calling the second title prediction layer to perform prediction processing on the image fusion characteristics, and generating a second prediction label for predicting whether the text forms a user editing title.

After the image fusion feature is acquired, a second title prediction layer may be called to perform prediction processing on the image fusion feature to generate a second prediction tag for predicting whether the text constitutes a title edited by the user, that is, the second prediction tag may be used to indicate whether the text in the target image is a title text edited by the user.

Substep A3: and calling the text sequence prediction layer to perform prediction processing on the image fusion characteristics, and generating a third prediction label for predicting the sequence of the text editing header text in the user.

After the image fusion feature is acquired, a text sequence prediction layer may be called to perform prediction processing on the image fusion feature, so as to generate a third prediction tag for predicting the sequence of the text in the user-edited title text, that is, the third prediction tag may be used to indicate the sequence of the text in the extracted target image in the user-edited title text.

Substep A4: and determining the predicted title text label according to the first predicted label, the second predicted label and the third predicted label.

After the first prediction tag, the second prediction tag and the third prediction tag are obtained through the above steps, the prediction title text tag corresponding to the target image can be determined according to the first prediction tag, the second prediction tag and the third prediction tag, that is, the first prediction tag, the second prediction tag and the third prediction tag are commonly used as the prediction title text tag.

After the prediction layer is called to perform prediction processing on the image fusion feature to obtain a predicted caption text label, step 206 is performed.

Step 206: and calculating to obtain a loss value corresponding to the to-be-trained title text extraction model based on the initial title text label and the predicted title text label.

After the prediction layer is called to perform prediction processing on the image fusion features to obtain a predicted title text label, a loss value corresponding to a title text extraction model to be trained can be obtained through calculation based on the initial title text label and the predicted title text label.

In addition, whether the task two judges the OCR texts to form the user editing title or not and the task three judges the sequence of the OCR texts in the user editing title have a certain correlation, that is, if the task two judges the OCR texts not to form the user editing title, the OCR texts do not have a corresponding sequence in the final editing title, that is, the task three should judge the OCR texts as other. Therefore, in order to strengthen the judgment correlation, the task introduces the judgment result of the task two into the loss function of the task three.

Specifically, during the initial stage of training, task three computes cross-entropy losses, i.e., data with true tags other than 0, for all OCR multi-modal features that make up the end-user edited title (i.e., data in task three with true tags other than 0)

Wherein p is _i Is a genuine label, q _i The labels are predicted for the model. After the training reaches a certain stage, introducing the judgment result of the task two, namely randomly selecting x data in each batch of data, replacing a real label with the prediction result of the model two, and then performing loss function calculation of the task three on the data of which the label is not 0, namely

Wherein r is _j Is the predictive label of the model in task two.

After calculating the loss value of the caption text extraction model to be trained, step 207 is executed.

Step 207: and under the condition that the loss value is within a preset range, taking the trained to-be-trained title text extraction model as the target title text extraction model.

After the loss value of the to-be-trained caption text extraction model is obtained through calculation, whether the loss value is within a preset range or not can be judged, and if the loss value is within the preset range, the trained to-be-trained caption text extraction model can be used as a target caption text extraction model.

When the loss value is out of the preset range, other sample images can be obtained to continue the training process of the trained model as in the above step 203 to step 206 until the loss value is in the preset range.

Step 208: and acquiring text features and image visual features corresponding to the target image.

The text feature refers to a feature formed by text within the target image, and in this example, the text feature may include: and the image text recognized by adopting a character recognition technology and the characteristics of a text box, a position and the like corresponding to the image text.

When extracting an image text edited by a user in a target image, a text feature and an image visual feature corresponding to the target image may be obtained, and specifically, the following specific implementation manner may be combined for detailed description.

In a specific implementation manner of the embodiment of the present disclosure, the step 208 may include:

substep B1: and identifying the image text in the target image based on a character identification technology.

In this embodiment, after obtaining the target image, a Character Recognition technology (e.g., OCR (Optical Recognition technology)) may be used to recognize the image text in the target image.

After the image text in the target image is identified based on character recognition techniques, sub-step B2 is performed.

Substep B2: and cutting the target image based on the region of the image text in the target image to generate a text region image and a non-text region image.

After the image text in the target image is identified, the region of the image text within the target image may be obtained, and the target image may be cropped based on the region to obtain a text region image (i.e., a region image containing text) and a non-text region image (i.e., a region image containing no text).

After generating the text region image and the non-text region image by cropping the target image based on the region of the image text within the target image, sub-step B3 is performed.

Substep B3: and respectively carrying out feature extraction processing on the text region image and the non-text region image based on a pre-training feature extraction model to obtain the text features and the image visual features.

After the target image is cropped based on the region of the image text in the target image to generate a text region image and a non-text region image, feature extraction processing can be respectively performed on the text region image and the non-text region image based on the pre-trained feature extraction model to obtain text features and image visual features.

After acquiring the text feature and the image visual feature corresponding to the target image, step 209 is performed.

Step 209: and inputting the text features and the image visual features into a target title text extraction model.

The target heading text extraction model refers to a model trained in advance for extracting text in an image.

After inputting the text features and image visual features corresponding to the target image into the target caption text extraction model, step 210 is performed.

Step 210: and processing the text features and the image visual features based on the target title text extraction model, and determining a target image title text corresponding to the target image.

The process for determining the title text of the target image can be described in detail in conjunction with the following specific implementation.

In a specific implementation manner of the embodiment of the present disclosure, the step 210 may include:

substep C1: and calling the coding layer to perform fusion processing on the text features and the image visual features to generate target image fusion features.

In this embodiment, the target caption text extraction model may include: an encoding layer and a prediction layer.

After the text features and the image visual features corresponding to the target image are obtained, the coding layer can be called to perform fusion processing on the text features and the image visual features so as to generate target image fusion features.

After the target image fusion feature is generated, sub-step C2 is performed.

Substep C2: and calling the prediction layer to perform title prediction processing on the target image fusion characteristics to obtain a target image title text corresponding to the target image.

After the target image fusion feature is generated, a prediction layer may be called to perform header prediction processing on the target image fusion feature to obtain a target image header text corresponding to the target image, which is similar to the process of the foregoing sub-step a 1-sub-step a2, and details of this embodiment are not repeated herein.

The embodiment of the disclosure fully utilizes the cross-modal information interaction of different granularity characteristics and text characteristics of the image, judges whether the image has an editing title or not, accurately extracts the editing title in the ranking chart, and improves the image optimization in quality and relevance to a certain extent.

EXAMPLE III

Referring to fig. 3, which shows a schematic structural diagram of an image title text determination apparatus provided in an embodiment of the present disclosure, as shown in fig. 3, the image title text determination apparatus 300 may include the following modules:

an image feature obtaining module 310, configured to obtain a text feature and an image visual feature corresponding to a target image;

an image feature input module 320, configured to input the text feature and the image visual feature to a target headline text extraction model;

and the image title determining module 330 is configured to process the text features and the image visual features based on the target title text extraction model, and determine a target image title text corresponding to the target image.

The image title text determination device provided by the embodiment of the disclosure inputs the text features and the image visual features to the target title text extraction model by acquiring the text features and the image visual features corresponding to the target image, processes the text features and the image visual features based on the target title text extraction model, and determines the target image title text corresponding to the target image. The embodiment of the disclosure judges whether the image has the editing title or not and accurately extracts the editing title in the ranking chart by fully utilizing the cross-modal information interaction of different granularity characteristics and text characteristics of the image, and improves the quality and the relevance of the image preferably to a certain extent.

Example four

Referring to fig. 4, which shows a schematic structural diagram of another image title text determination apparatus provided in an embodiment of the present disclosure, as shown in fig. 4, the image title text determination apparatus 400 may include the following modules:

a sample image acquisition module 410 for acquiring a sample image; the sample image is an image containing text, and the sample image corresponds to an initial title text label;

a sample image feature obtaining module 420, configured to obtain a sample text feature and a sample image visual feature corresponding to the sample image;

a sample image characteristic input module 430, configured to input the sample text characteristic and the sample image visual characteristic to a to-be-trained title text extraction model; the title text extraction model to be trained comprises the following steps: an encoding layer and a prediction layer;

the image fusion feature generation module 440 is configured to invoke the coding layer to perform fusion processing on the sample text feature and the sample image visual feature, so as to generate an image fusion feature;

a predicted title tag obtaining module 450, configured to invoke the prediction layer to perform prediction processing on the image fusion feature, so as to obtain a predicted title text tag;

a loss value calculation module 460, configured to calculate a loss value corresponding to the to-be-trained heading text extraction model based on the initial heading text label and the predicted heading text label;

a target heading extraction model obtaining module 470, configured to take the trained heading text extraction model to be trained as the target heading text extraction model when the loss value is within the preset range;

an image feature obtaining module 480, configured to obtain text features and image visual features corresponding to a target image;

an image feature input module 490, configured to input the text feature and the image visual feature to a target headline text extraction model;

an image title determining module 4100, configured to process the text feature and the image visual feature based on the target title text extraction model, and determine a target image title text corresponding to the target image.

Optionally, the image feature obtaining module includes:

and the image feature acquisition unit is used for respectively carrying out feature extraction processing on the text region image and the non-text region image based on a pre-training feature extraction model to obtain the text features and the image visual features.

the predicted title tag obtaining module includes:

the image title determination module includes:

An embodiment of the present disclosure also provides an electronic device, including: a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the image title text determination method of the foregoing embodiments when executing the program.

Embodiments of the present disclosure also provide a readable storage medium, in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the image title text determination method of the foregoing embodiments.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present disclosure are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the embodiments of the present disclosure as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the embodiments of the present disclosure.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the disclosure may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the embodiments of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that is, claimed embodiments of the disclosure require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of an embodiment of this disclosure.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

The various component embodiments of the disclosure may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be understood by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in a motion picture generating device according to an embodiment of the present disclosure. Embodiments of the present disclosure may also be implemented as an apparatus or device program for performing a portion or all of the methods described herein. Such programs implementing embodiments of the present disclosure may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit embodiments of the disclosure, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Embodiments of the disclosure may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The above description is only for the purpose of illustrating the preferred embodiments of the present disclosure and is not to be construed as limiting the embodiments of the present disclosure, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the embodiments of the present disclosure are intended to be included within the scope of the embodiments of the present disclosure.

The above description is only a specific implementation of the embodiments of the present disclosure, but the scope of the embodiments of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present disclosure, and all the changes or substitutions should be covered by the scope of the embodiments of the present disclosure. Therefore, the protection scope of the embodiments of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. An image title text determination method, comprising:

2. The method of claim 1, wherein the obtaining of the text feature and the image visual feature corresponding to the target image comprises:

3. The method of claim 1, further comprising, prior to said inputting said text features and said image visual features into a target headline text extraction model:

4. The method of claim 3, wherein the prediction layer comprises: a first caption prediction layer, a second caption prediction layer, and a text order prediction layer,

5. The method of claim 1, wherein the target headline text extraction model comprises: an encoding layer and a prediction layer,

the processing the text features and the image visual features based on the target title text extraction model to determine a target image title text corresponding to the target image comprises:

6. An image title text determination apparatus, comprising:

7. The apparatus of claim 6, wherein the image feature acquisition module comprises:

8. The apparatus of claim 6, further comprising:

the image fusion feature generation module is used for calling the coding layer to perform fusion processing on the sample text features and the sample image visual features to generate image fusion features;

9. The apparatus of claim 8, wherein the prediction layer comprises: a first caption prediction layer, a second caption prediction layer and a text order prediction layer,

the predicted title tag obtaining module includes:

10. The apparatus of claim 6, wherein the target headline text extraction model comprises: an encoding layer and a prediction layer,

the image title determination module includes:

11. An electronic device, comprising:

a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the image title text determination method as claimed in any one of claims 1 to 5 when executing the program.

12. A readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the image title text determination method of any of method claims 1 to 5.