CN117934974A

CN117934974A - Scene text task processing method, system, equipment and storage medium

Info

Publication number: CN117934974A
Application number: CN202410326155.8A
Authority: CN
Inventors: 张勇东; 张博强; 谢洪涛; 王裕鑫
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2024-03-21
Filing date: 2024-03-21
Publication date: 2024-04-26
Anticipated expiration: 2044-03-21
Also published as: CN117934974B

Abstract

The invention discloses a scene text task processing method, a system, equipment and a storage medium, which are used for decoupling two types of features (style features and content features) commonly existing in scene text images in a pre-training stage, and guiding a model to decompose the content features and style features in the scene text images through decoupling characterization learning so as to better complete various scene text tasks and obtain more discriminant characterization; after training, different characteristics can be selected to be realized when different tasks are completed, so that the model can better complete different scene text tasks. A large number of experiments prove that the method has better performance than the prior method, and has advanced performance for scene text recognition, editing and erasing tasks.

Description

Scene text task processing method, system, equipment and storage medium

Technical Field

The present invention relates to the field of scene text task processing technologies, and in particular, to a scene text task processing method, system, device, and storage medium.

Background

Language is widely used in natural scenarios as an important information carrier. Scene text is an important topic in scene understanding and perception. Recognition, editing and erasing of scene text are key tasks in the field of scene text. These tasks are widely used in the fields of man-machine interaction, automatic driving and the like.

Conventional methods generally can only implement a single task, and for multi-task implementation, a method of characterization learning is generally used. These methods utilize token learning to improve image feature quality and thus improve the performance of the model on different downstream tasks. In the field Jing Wen, these tasks typically employ mask image modeling and feature contrast learning to pretrain the backbone network. The pre-trained backbone network is then used to fine tune the decoder for a particular task. While this approach achieves superior performance, it is apparent that it is suboptimal to use the same features for different scene text tasks, without regard to the specificity of the text image.

In view of this, the present invention has been made.

Disclosure of Invention

The invention aims to provide a scene text task processing method, a system, equipment and a storage medium, which are used for guiding a model to decompose content characteristics and style characteristics in a scene text image through decoupling characterization learning so as to better complete various scene text tasks.

The invention aims at realizing the following technical scheme:

A scene text task processing method comprises the following steps:

Constructing a scene text task processing model, comprising: a visual encoder, a decoupling network and a multi-tasking decoder;

Training the scene text task processing model, wherein the training process comprises pre-training and fine-tuning, and during pre-training, a decoupling training data set containing a plurality of text image pairs is obtained, wherein each text image pair contains two text images with the same background and font style and different text contents; the visual characteristics of each text image in each text image pair are respectively extracted by using a visual encoder, and are decoupled into content characteristics and style characteristics through a decoupling network, and the alignment loss is calculated by using the style characteristics of the two text images in the same text image pair; respectively inputting the content characteristics of each text image into a multitask decoder to obtain text recognition results, and calculating recognition loss by using the text recognition results; inputting style characteristics and content characteristics of one text image in the text image pair and text prompts in the other text image into a multitask decoder to obtain a reconstruction result of reconstructing the background image and the other text image, and calculating corresponding reconstruction loss by using the reconstruction result of reconstructing the background image and the other text image; pre-training the scene text task processing model by combining all the calculated losses;

And executing the scene text task by using the trained scene text task processing model.

A scene text task processing system, comprising:

The model construction unit is used for constructing a scene text task processing model and comprises the following steps: a visual encoder, a decoupling network and a multi-tasking decoder;

The model training unit is used for training the scene text task processing model, the training process comprises pre-training and fine-tuning, and during pre-training, a decoupling training data set containing a plurality of text image pairs is obtained, and two text images contained in each text image pair have the same background and font style and different text contents; the visual characteristics of each text image in each text image pair are respectively extracted by using a visual encoder, and are decoupled into content characteristics and style characteristics through a decoupling network, and the alignment loss is calculated by using the style characteristics of the two text images in the same text image pair; respectively inputting the content characteristics of each text image into a multitask decoder to obtain text recognition results, and calculating recognition loss by using the text recognition results; inputting style characteristics and content characteristics of one text image in the text image pair and text prompts in the other image into a multi-task decoder to obtain a reconstruction result of the other image, and calculating reconstruction loss by using the reconstruction result; pre-training the scene text task processing model by combining all the calculated losses;

And the task processing unit is used for executing the scene text task by using the trained scene text task processing model.

A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

Wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.

A readable storage medium storing a computer program which, when executed by a processor, implements the method described above.

According to the technical scheme provided by the invention, two types of features (style features and content features) commonly existing in the scene text image are decoupled, and finally, different features are selected to realize when different tasks are completed, so that the aim of better completing a plurality of downstream tasks (namely scene text tasks) is fulfilled.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a scene text task processing method according to an embodiment of the present invention;

FIG. 2 is a frame diagram of a scene text task processing model provided by an embodiment of the invention;

FIG. 3 is an exemplary diagram of a multi-decoupling characterization learning synthesis data set provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a multi-tasking decoder according to an embodiment of the present invention;

FIG. 5 is a schematic view of the effect of the present invention on scene text editing provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of the effect of the present invention on scene text erasure provided by an embodiment of the present invention;

FIG. 7 is a schematic diagram of a scene text task processing system according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The terms that may be used herein will first be described as follows:

The terms "comprises," "comprising," "includes," "including," "has," "having" or other similar referents are to be construed to cover a non-exclusive inclusion. For example: including a particular feature (e.g., a starting material, component, ingredient, carrier, formulation, material, dimension, part, means, mechanism, apparatus, step, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product or article of manufacture, etc.), should be construed as including not only a particular feature but also other features known in the art that are not explicitly recited.

The method, the system, the equipment and the storage medium for processing the scene text task are described in detail below. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer.

Example 1

The embodiment of the invention provides a scene text task processing method, which mainly comprises the following steps as shown in fig. 1:

And step1, constructing a scene text task processing model.

In the embodiment of the invention, the scene text task processing model mainly comprises: visual encoder, decoupling network and multi-tasking decoder.

And step2, training the scene text task processing model.

In the embodiment of the invention, the training process comprises a pre-training part and a fine-tuning part, and during the pre-training, a decoupling training data set containing a plurality of text image pairs is obtained, wherein two text images contained in each text image pair have the same background and font style and different text contents; the visual characteristics of each text image in each text image pair are respectively extracted by using a visual encoder, and are decoupled into content characteristics and style characteristics through a decoupling network, and the alignment loss is calculated by using the style characteristics of the two text images in the same text image pair; respectively inputting the content characteristics of each text image into a multitask decoder to obtain text recognition results, and calculating recognition loss by using the text recognition results; inputting style characteristics and content characteristics of one text image in the text image pair and text prompts in the other text image into a multitask decoder to obtain a reconstruction result of reconstructing the background image and the other text image, and calculating corresponding reconstruction loss by using the reconstruction result of reconstructing the background image and the other text image; the scene text task processing model is pre-trained in combination with all the losses calculated.

And step 3, executing the scene text task by using the trained scene text task processing model.

The trained scene text task processing model can be embedded into an intelligent system to realize various tasks of scene texts, including scene text recognition, editing and erasing. And a subsequent module serving as a text detection module can be used for identifying, editing or erasing text contents in the coordinate area after the position coordinates of the text area are given. In implementation, the system can be installed on a server or embedded in an intelligent system in a software mode, and the background mass demand is met.

Preferably, considering that the pre-training is mainly aimed at a scene text editing task, a scene text erasing task and a scene text recognition task can be realized theoretically, but in order to improve the task effect, the pre-training is applied to the scene text erasing task and the scene text recognition task, and the corresponding task data set is required to be used for fine adjustment.

Compared with the prior art, the scheme provided by the embodiment of the invention starts from the characteristic that the text image is different from the conventional scene image, and the decoupling of the visual space content characteristics and the style characteristics is realized in the pre-training stage, so that the model can complete different scene text tasks, and meanwhile, the more discriminant characterization is obtained. A large number of experiments prove that the method has better performance than the prior method, and has advanced performance for scene text recognition, editing and erasing tasks.

In order to more clearly demonstrate the technical scheme and the technical effects provided by the invention, the method provided by the embodiment of the invention is described in detail below by using specific embodiments.

1. The scheme principle is introduced.

Unlike a general scene image, a cropped scene text image generally includes a region having a high information density, and diverse backgrounds. Such two types of features can be summarized as content features and style features. Style characteristics include background and font style (e.g., font, color, gradient, etc.), while content characteristics include textual content and texture details. For scene text recognition tasks, content features are desirable features, while style features are detrimental features that can act as noise impediments to accurate text recognition. However, for a scene text editing task, the entire process can be divided into two phases: text erasure and text rendering. The text erasing stage needs to erase the original text and reconstruct the background image, and the process needs content characteristics to carry out handwriting positioning and needs style characteristics to reconstruct the background image. The text rendering phase relies on the content characteristics to generate new text scripts and style properties to define fonts. Furthermore, the scene text erasure task may also be a separate task. Thus, different downstream tasks require different information, and features unrelated to a particular task may prevent completion of the task. The decomposition of text features and style features may be applied to various scene text tasks. Therefore, the method and the device guide the model to decompose the content characteristics and the style characteristics in the scene text pictures through decoupling characterization learning so as to complete various tasks.

As shown in fig. 2, a framework diagram of a scene text task processing model is shown, the scene text task processing model is a decoupling representation learning model, and the learning process (i.e., training process) includes: a pre-training phase and a fine-tuning phase.

A pre-training stage, first synthesizes a larger scale (e.g., 400 ten thousand) text image pair. Each text image pair has the same background and font style, but different content. Each text image pair simultaneously inputs a scene text task processing model. First, a text image pair is input to a visual encoder to extract visual features, which are separated into two parts by a decoupling network, representing content features and style features, respectively. For decoupling purposes, a task-specific loss function is employed to help decouple the two-part feature. In particular, the scene text task processing model contains a multi-tasked decoder that is split into two branches: generating branches and distinguishing branches. The distinguishing branch for the scene text recognition task only inputs the content characteristics, and the recognition loss function can ensure that the content characteristics only contain content information and detail information of texts in images in a training stage, so that font style and background information serving as recognition noise are eliminated. In addition, since the styles of the text contained in each image pair are identical, the alignment loss is used to align the style characteristics of the two images in the pair. Finally, a generating branch is used for reconstructing a background image and the other text image in the text image pair, that is, when the generating branch is applied to a scene text editing task or a text erasing task, the generating branch is used for inputting style characteristics and content characteristics of the one text image and text prompts of the other image, in the example shown in fig. 2, rectangular boxes corresponding to various English characters represent the text prompts, wherein [ P ] represents a placeholder, and the same length is ensured for each input.

Because the pre-training process comprises the task of reconstructing the image pair, the pre-training process can be directly applied to the scene text editing task after the pre-training is finished. Because the scene text editing task often comprises an erasing step, the pre-trained model can theoretically realize the scene text erasing task, and meanwhile, the pre-training loss also comprises a recognition loss, so that the pre-training model can be theoretically used for the scene text recognition task; however, since the pre-training is not specific to the erasing and recognition task, further fine tuning on a specific data set is required for the scene text erasing task and the scene text recognition task.

In the embodiment of the invention, fine adjustment can be performed by combining specific tasks, for example, when a scene text is erased, fine adjustment can be performed by combining corresponding task data sets with reconstruction loss, specifically, a fine adjustment visual encoder and a background reconstruction head (see description later for details), and when a scene text is identified, fine adjustment can be performed by combining corresponding task data sets with identification loss, specifically, a fine adjustment visual encoder and a text identification head (see description later for details); the specific process involved in the pre-training and fine-tuning stage by combining the correlation loss in the embodiment of the present invention may refer to the conventional technology, and the present invention will not be described in detail.

And finally, storing the complete model converged in the fine tuning stage for the operation of an actual system. Inputting text images scaled to a particular resolution in the inference stage will use different decoders to accomplish scene word recognition, erasure and editing.

2. And (5) introducing scheme details.

1. A decoupling training data set is acquired.

The core design of the invention is to decouple the content characteristics and style characteristics in the text image by using decoupling characterization learning, so as to complete downstream tasks with different requirements, including scene text recognition, editing and erasing. In the pre-training framework of the present invention, decoupled training data sets need to be used. To accommodate subsequent decoupling characterization learning tasks, the present invention synthesizes a text image pair that contains the same style but different content, as shown in FIG. 3, providing examples of some text image pairs. For a given raw input imageWherein R is a real number set,/>Representing width/>Representing the height,/>Representing the number of channels of the image. The data set can be synthesized by using the existing text image synthesis engine, and has the advantages of low acquisition cost, simplicity and easiness in deployment.

2. A visual encoder.

In the embodiment of the invention, a ViT (Vision Transformer, visual transformer) model can be used as a visual encoder to map a single image X in an image pair into a potential feature representation space to obtain visual features. After extracting the visual features by the visual encoder, the visual features are fed into a decoupling network for decoupling, the visual features being divided into two parts, namely content features/>And style characteristics/>Style characteristics include background and font style, and content characteristics include text content and texture details.

3. A multi-task decoder.

To accomplish multiple tasks while helping to decouple features, a multi-tasked decoder is designed in the model. The structure of the multi-tasking decoder is illustrated in fig. 4. To accomplish different types of tasks, including generating tasks and discriminating tasks, the multi-task decoder is divided into two branches: generating branches and distinguishing branches.

(1) And judging the branches.

Wherein the discrimination branch is used for completing the scene text recognition task, and the input of the discrimination branch is only content features. The discrimination branch includes: a first N-layer attention network and a text recognition head; the content characteristics of each text image are input into the first N-layer attention network, the characteristics are extracted through the first N-layer attention network, and the last layer of output characteristics of the first N-layer attention network obtain a text recognition result through a text recognition head.

By way of example, the text recognition head may be implemented using a layer of mutual attention mechanism, the formula of which is as follows:

；

wherein, Is a learnable query vector,/>To determine the features of the N-th layer in the branch,/>And/>Is a linear mapping,/>For the linear layer used for classification, T is the transposed symbol.

(2) Branches are generated.

The generation branch requires both style and content features in order to complete the generation task. Because the detail information of the content features is less and the missing detail information is extracted in the judging branch, a gating injection mechanism is designed to inject the fine granularity information extracted in the judging branch into the generating branch so as to help the completion of the generating task.

Specifically: the generating branch includes: the method comprises the steps of gating an injection mechanism layer, a second N-layer attention network, a background reconstruction head and a text rendering head; wherein the input of the second N-tier attention network comprises: the method comprises the steps that in a text image pair, a style characteristic and a content characteristic of one text image and a text prompt in another image are fused with an output characteristic of a corresponding layer of a first N-layer attention network in a distinguishing branch through a gating injection mechanism layer, and then are spliced with a style characteristic part and a text prompt part in the output characteristic of the second N-layer attention network, the spliced characteristic is used for processing of the next layer, and the output of the last layer serves as a final characteristic; and the background reconstruction head reconstructs a background image by using the final characteristics, and the text rendering head acquires a reconstruction result of the other text image by using the text content corresponding to the text prompt and the final characteristics.

In the embodiment of the present invention, the processing procedure of the gating injection mechanism layer and the relevant splicing procedure are expressed as follows:

；

wherein, For the content feature part in the i-th layer output feature in the second N-layer attention network,/>To determine the i-th layer output characteristics in the first N-layer attention network of the branch, i=1, …, N,/>Layer will/>, for injection mechanism through gatingAnd/>The fused features; /(I)For the text prompt in the i-th layer output feature in the second N-layer attention network,/>For a style feature part in the i-th layer output feature in the second N-layer attention network,/>For splice feature, for processing of layer i+1,/>For the text prompt in the i+1th layer output feature in the second N layer attention network,/>For a style feature part in the i-th layer output feature in the second N-layer attention network,/>Content feature portions in the i-th layer output feature in the second N-layer attention network; /(I)For gating multi-layer perceptrons (MLPs) in the injection mechanism layer,/>To activate the function,/>Representing gating factors,/>Representing a matrix point multiplication (simply referred to as multiplication) operation,/>Representing an add operation,/>Representing a self-attention layer.

In the embodiment of the present invention, the N-layer attention network structures in the two branches are the same, each layer includes a self-attention layer and a forward network, and the specific processing procedures involved in them may refer to the conventional technology, which is not described herein in detail; in addition, in the above processing procedure, when i=n, the splicing feature is a final feature, and when i=1, the feature used is related information input to the generating branch, and this part also belongs to conventional logic, so that details are not repeated.

4. Decoupling characterization learning.

For decoupling purposes, a task-specific loss function is employed to help decouple the two-part feature.

(1) The recognition loss function ensures that the content characteristics only contain content information and detail information of texts in the images in a pre-training stage, and the style and background information serving as recognition noise are eliminated. The recognition loss is expressed as follows:

；

wherein, To identify loss,/>For a preset maximum character length,/>For the text recognition result of the j-th character,/>Text label for j-th character,/>Expressed in text label as/>In the case of (2), the text recognition result is/>Is a probability of (2).

(2) Since the styles of the text contained in each image pair are consistent, the style characteristics of the two images in the pair are aligned by using an alignment loss, wherein the alignment loss is realized by using an L2 loss, and the alignment loss is expressed by the following formula:

；

wherein, And/>For style characteristics of two text images in the same text image pair,/>For alignment loss.

(3) Reconstructing a background image using one of the generating branches and the other image of the text image pair, the style characteristics being input as one of the images during reconstructionThe reconstruction loss comprises two parts of background reconstruction loss and editing image reconstruction loss, and the reconstruction results of the reconstructed background image and the text image are respectively supervised; when the training data set is manufactured, the background image is manufactured independently, the background reconstruction loss is the loss of the reconstructed background image and the corresponding background image, the editing image reconstruction loss is the loss of the reconstruction result of the other text image and the other text image in the text image pair, and the two reconstruction losses can be calculated by adopting the L2 loss, and are not repeated here.

As described earlier, the pre-training is performed based on the three losses, and the model after the pre-training can be directly applied to the scene text editing task; for a scene text erasing task and a scene text recognition task, in order to improve the task effect, a pre-trained model needs to be finely tuned by using a related task data set, and particularly, when the method is applied to the scene text erasing task, the pre-trained model is finely tuned by using a related task data set combined with reconstruction loss; when the method is applied to scene character recognition tasks, the pre-trained model is subjected to fine adjustment by using the related task data set and the recognition loss.

5. Training details.

For the pre-training phase, pre-training is performed on the synthesized 400-ten thousand text image pairs. The initial learning rate is 1e-4, the total training is 20 ten thousand steps, and the learning rate decays to 1e-5 after 7 training rounds. The weight decay was set to 0.05, the optimizer momentum parameters were set to 0.9 and 0.99, and the batch size was set to 384. Further, the input image size is 32×128 resolution.

The same image resolution size and window size as in the pre-training stage is used in the fine tuning process. For scene text recognition, 40 ten thousand steps are trimmed. For scene text erasure, fine tuning is performed for 10 ten thousand steps.

It should be noted that the training details provided herein are merely preferred embodiments, and in practical applications, the user may adjust according to actual situations or experience.

3. And (5) effect evaluation.

In order to verify the effectiveness of the invention, the invention performs verification evaluation on three tasks of scene text recognition, editing and erasing. The model of the invention achieves advanced performance on three tasks: the identification dataset was Union-Standard, and testing was performed using its subsets Curve, multi-Oriented, artistic, context-less, salient, multi-words, general, and the evaluation of test performance on the identification dataset of the present invention is detailed in Table 1.

Table 1: text recognition accuracy over multiple data sets

The benchmark model in table 1 is the scene text task processing model described previously, but is not trained using the pre-training scheme provided by the present invention, but is fine-tuned directly on the recognition dataset. From the results shown in Table 1, it can be seen that the recognition accuracy in the field Jing Wenben recognition task of the present invention is significantly better than that of the reference model.

In addition, fig. 5 and 6 are examples of effects of the present invention on scene text editing and erasing tasks, and it can be seen that the present invention can achieve better effects in the above two tasks.

From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.

Example two

The invention also provides a scene text task processing system, which is mainly used for realizing the method provided by the previous embodiment, as shown in fig. 7, and mainly comprises:

The model training unit is used for training the scene text task processing model by using the decoupling training data set, the training process comprises pre-training and fine-tuning, and during pre-training, the decoupling training data set comprising a plurality of text image pairs is obtained, and each text image pair comprises two text images with the same background and font style and different text contents; the visual characteristics of each text image in each text image pair are respectively extracted by using a visual encoder, and are decoupled into content characteristics and style characteristics through a decoupling network, and the alignment loss is calculated by using the style characteristics of the two text images in the same text image pair; respectively inputting the content characteristics of each text image into a multitask decoder to obtain text recognition results, and calculating recognition loss by using the text recognition results; inputting style characteristics and content characteristics of one text image in the text image pair and text prompts in the other image into a multi-task decoder to obtain a reconstruction result of the other image, and calculating reconstruction loss by using the reconstruction result; pre-training the scene text task processing model by combining all the calculated losses;

In view of the technical details of the units in the system that have been described in detail in the previous embodiment, they will not be described in detail.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the system is divided into different functional modules to perform all or part of the functions described above.

Example III

The present invention also provides a processing apparatus, as shown in fig. 8, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.

Further, the processing device further comprises at least one input device and at least one output device; in the processing device, the processor, the memory, the input device and the output device are connected through buses.

In the embodiment of the invention, the specific types of the memory, the input device and the output device are not limited; for example:

the input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;

the output device may be a display terminal;

The memory may be random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as disk memory.

Example IV

The invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiments.

The readable storage medium according to the embodiment of the present invention may be provided as a computer readable storage medium in the aforementioned processing apparatus, for example, as a memory in the processing apparatus. The readable storage medium may be any of various media capable of storing a program code, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, and an optical disk.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1.A scene text task processing method, comprising:

2. The method according to claim 1, wherein the step of extracting visual features of each text image in each text image pair using a visual encoder and decoupling the visual features into content features and style features via a decoupling network comprises:

the single text image is marked as X, the extracted visual characteristic is marked as F, and the extracted visual characteristic is decoupled as content characteristic And style characteristics/>; The style features include background and font style, and the content features include text content and texture details.

3. A method of scene text task processing according to claim 1, characterized in that the alignment loss is expressed as:

；

4. The scene text task processing method according to claim 1, wherein the inputting the content features of each text image into the multitasking decoder to obtain the text recognition result comprises:

the multitasking decoder comprises a judging branch which is used for identifying the scene text;

The discrimination branch includes: a first N-layer attention network and a text recognition head; the content characteristics of each text image are input into the first N-layer attention network, the characteristics are extracted through the first N-layer attention network, and the last layer of output characteristics of the first N-layer attention network obtain a text recognition result through a text recognition head.

5. A method of scene text task processing according to claim 1 or 4, characterized in that said recognition penalty is expressed as:

；

wherein, To identify loss,/>For a preset maximum character length,/>For the text recognition result of the j-th character,/>Text label for j-th character,/>Expressed in text label as/>In the case of (2), the text recognition result isIs a probability of (2).

6. The method according to claim 4, wherein inputting the style and content features of one of the text images and the text prompts of the other text image into the multitasking decoder, obtaining the reconstruction result of reconstructing the background image and the other text image comprises:

the multitasking decoder comprises a generating branch which is used for generating a class task through texts;

The generating branch includes: the method comprises the steps of gating an injection mechanism layer, a second N-layer attention network, a background reconstruction head and a text rendering head; wherein the input of the second N-tier attention network comprises: the method comprises the steps that in a text image pair, style characteristics and content characteristics of one text image and text prompts in another image are fused with output characteristics of a corresponding layer of a first N-layer attention network in a distinguishing branch through a gating injection mechanism layer, then the fused content characteristics are spliced with style characteristics and text prompts in the output characteristics of the second N-layer attention network, the spliced characteristics are used for processing of the next layer, and output of the last layer serves as final characteristics; and the background reconstruction head reconstructs a background image by using the final characteristics, and the text rendering head acquires a reconstruction result of the other text image by using text contents corresponding to the text prompts and the final characteristics.

7. The method of claim 6, wherein the processing procedure of the gating injection mechanism layer and the relevant splicing procedure are expressed as:

；

wherein, For the content feature part in the i-th layer output feature in the second N-layer attention network,/>To determine the i-th layer output characteristics in the first N-layer attention network of the branch, i=1, …, N,/>Layer will/>, for injection mechanism through gatingAnd/>The fused features; /(I)For the text prompt in the i-th layer output feature in the second N-layer attention network,/>For a style feature part in the i-th layer output feature in the second N-layer attention network,/>For splice feature, for processing of layer i+1,/>For the text prompt in the i+1th layer output feature in the second N layer attention network,/>For a style feature part in the i-th layer output feature in the second N-layer attention network,/>As for the content feature part in the i-th layer output feature in the second N-layer attention network, when i=n, the splicing feature is the final feature; /(I)For gating multi-layer perceptrons in the injection mechanism layer,/>To activate the function,/>Representing gating factors,/>Representing a matrix dot product operation,/>Representing an add operation,/>Representing a self-attention layer.

8. A scene text task processing system, comprising:

9. A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

Wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.

10. A readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1-7.