CN116863017A - Image processing method, network model training method, device, equipment and medium - Google Patents

Image processing method, network model training method, device, equipment and medium Download PDF

Info

Publication number
CN116863017A
CN116863017A CN202310659387.0A CN202310659387A CN116863017A CN 116863017 A CN116863017 A CN 116863017A CN 202310659387 A CN202310659387 A CN 202310659387A CN 116863017 A CN116863017 A CN 116863017A
Authority
CN
China
Prior art keywords
image
network
erasing
erasure
document image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310659387.0A
Other languages
Chinese (zh)
Inventor
刘腾龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202310659387.0A priority Critical patent/CN116863017A/en
Publication of CN116863017A publication Critical patent/CN116863017A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19147Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1918Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/30Writer recognition; Reading and verifying signatures
    • G06V40/33Writer recognition; Reading and verifying signatures based only on signature image, e.g. static signature recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Geometry (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Computer Graphics (AREA)
  • Editing Of Facsimile Originals (AREA)

Abstract

The disclosure provides an image processing method, a network model training method, a device, equipment and a medium, relates to the field of image processing, and particularly relates to the technical fields of computer vision, deep learning, image reconstruction and the like. The specific implementation scheme is as follows: the method comprises the steps of inputting a document to be processed into a pre-trained erasing network, and erasing characters in a first format in a document image to be processed to obtain a final erasing image corresponding to the document image to be processed; the erasing network and the dividing head network and the discriminator network form an erasing network model, and the erasing network is obtained by training the erasing network model in advance by using a sample document image pair; the sample document image pair comprises a sample document image and a label image corresponding to the sample document image, wherein the label image is erased with characters in a first format; the erasing network model is trained at least according to the content loss function and the style loss function; the content loss function and the style loss function are obtained according to the mixed image characteristics of the sample document image and the label image.

Description

Image processing method, network model training method, device, equipment and medium
Technical Field
The present disclosure relates to the field of image processing technologies, and in particular, to the technical fields of computer vision, deep learning, image reconstruction, and the like. In particular, the disclosure relates to an image processing method, a network model training method, a device, equipment and a medium.
Background
Text erasure, especially text erasure in a specific format (e.g., handwriting text), is widely used in education, office, privacy protection, image editing, etc.
If the handwriting of the office document with handwriting is erased, a blank document can be restored by one key; and (3) handwriting erasing is carried out on the answered test paper image, so that the answered test paper can be restored to a blank test paper for the students to answer again and the education institutions carry out question bank record arrangement, test paper format conversion and the like.
Disclosure of Invention
The disclosure provides an image processing method, a network model training method, a device, equipment and a medium.
According to a first aspect of the present disclosure, there is provided a method of image processing, the method comprising:
acquiring a document image to be processed;
the method comprises the steps that a first format word in a document image to be processed is erased by inputting the document to be processed into a pre-trained erasing network, and a final erasing image corresponding to the document image to be processed is obtained;
The erasing network, the dividing head network and the discriminator network form an erasing network model, and the erasing network is obtained by training the erasing network model in advance by using a sample document image pair; the sample document image pair comprises a sample document image and a label image corresponding to the sample document image, wherein the label image is erased with characters in a first format;
the erasure network model is trained at least according to the content loss function and the style loss function; the content loss function and the style loss function are obtained according to the mixed image characteristics of the sample document image and the label image; and the mixed image features are obtained according to the pixels corresponding to the first format characters in the sample document image and other pixels except the pixels corresponding to the first format characters in the image output by the erasing network.
According to a second aspect of the present disclosure, there is provided a training method of erasing a network, the method comprising:
acquiring a sample document image pair, wherein the sample document image pair comprises a sample document image and a label image corresponding to the sample document image, in which the characters of a first format are erased;
training an erasure network model according to the sample document image pair to obtain an erasure network;
The erasure network, the segmentation head network and the discriminator network form the erasure network model; the erasing network is used for erasing the first format characters in the sample document image input into the erasing network, and obtaining a final erasing image corresponding to the sample document image; the segmentation head network is used for acquiring a mask corresponding to the first format text in the sample document image; the discriminator network is used for judging whether the first format characters in the final erased image are erased or not;
the erasure network model is trained at least according to the content loss function and the style loss function; the content loss function and the style loss function are obtained according to the mixed image characteristics of the sample document image and the label image; and the mixed image features are obtained according to the pixels corresponding to the first format characters in the sample document image and other pixels except the pixels corresponding to the first format characters in the image output by the erasing network.
According to a third aspect of the present disclosure, there is provided an apparatus for image processing, the apparatus comprising:
the image module is used for acquiring a document image to be processed;
the reasoning module is used for erasing the first format characters in the document image to be processed by inputting the document to be processed into a pre-trained erasing network, so as to obtain a final erasing image corresponding to the document image to be processed;
The erasing network, the dividing head network and the discriminator network form an erasing network model, and the erasing network is obtained by training the erasing network model in advance by using a sample document image pair; the sample document image pair comprises a sample document image and a label image corresponding to the sample document image, wherein the label image is erased with characters in a first format;
the erasure network model is trained at least according to the content loss function and the style loss function; the content loss function and the style loss function are obtained according to the mixed image characteristics of the sample document image and the label image; and the mixed image features are obtained according to the pixels corresponding to the first format characters in the sample document image and other pixels except the pixels corresponding to the first format characters in the image output by the erasing network.
According to a fourth aspect of the present disclosure, there is provided a training apparatus for erasing a network, the apparatus comprising:
the sample module is used for acquiring a sample document image pair, wherein the sample document image pair comprises a sample document image and a label image corresponding to the sample document image, and the label image is erased with characters in a first format;
The training module is used for training the erasing network model according to the sample document image pair to acquire an erasing network;
the erasure network, the segmentation head network and the discriminator network form the erasure network model; the erasing network is used for erasing the first format characters in the sample document image input into the erasing network, and obtaining a final erasing image corresponding to the sample document image; the segmentation head network is used for acquiring a mask corresponding to the first format text in the sample document image; the discriminator network is used for judging whether the first format characters in the final erased image are erased or not;
the erasure network model is trained at least according to the content loss function and the style loss function; the content loss function and the style loss function are obtained according to the mixed image characteristics of the sample document image and the label image; and the mixed image features are obtained according to the pixels corresponding to the first format characters in the sample document image and other pixels except the pixels corresponding to the first format characters in the image output by the erasing network.
According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:
At least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of image processing and/or the training method of the erasure network.
According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the above-described method of image processing and/or a training method of an erasure network.
According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the above-described method of image processing and/or a training method of an erasure network.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow chart of a method of image processing provided by an embodiment of the present disclosure;
FIG. 2 is a flow chart of some steps of another method of image processing provided by an embodiment of the present disclosure;
FIG. 3 is a flow chart of some steps of another method of image processing provided by an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a partial network structure of an erasure network model of another method of image processing provided by an embodiment of the present disclosure;
FIG. 5 is a flow chart of some steps of another method of image processing provided by an embodiment of the present disclosure;
FIG. 6 is a flow chart of some steps of another method of image processing provided by an embodiment of the present disclosure;
FIG. 7 is a flow chart of some steps of another method for garment image processing provided by an embodiment of the present disclosure;
fig. 8 is a flowchart of a training method for erasing a network according to an embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of an apparatus for image processing according to an embodiment of the present disclosure;
FIG. 10 is a schematic diagram of a training device for erasing a network according to an embodiment of the present disclosure;
fig. 11 is a block diagram of an electronic device for implementing the method of image processing and the training method of erasing a network of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In some related technologies, a text region of a text in a specific format in an image to be recognized may be located by an OCR (optical character recognition) technology, and then the text in the specific format in the image to be recognized is erased using a related method of image restoration.
However, because the OCR technology performs region detection, not character-by-character detection, when there are characters in other formats in the image to be detected, the problem of overlapping characters in different formats cannot be solved. And in the case that shadows exist on the background of the image to be recognized, and the background of the image to be recognized needs to be turned white, the background shadow removal cannot be realized by the OCR technology, and the background shadow removal needs to be regarded as an upstream task of OCR.
In some related technologies, a text erasure task with a specific format may be regarded as a semantic segmentation task, pixels of an image to be identified are classified into a background class, a text class with a specific format, and text classes with other formats by using a semantic segmentation model, pixels corresponding to the text with the specific format in the image to be identified are identified by classifying the image pixel by pixel, and erasure of the text with the specific format is realized by replacing the pixels corresponding to the text with the specific format in the image to be identified with the background class.
However, if the pixel-by-pixel classification is performed, the erasing effect of the text in the specific format is poor, especially if the text in the specific format is easy to be classified by mistake, and the text in the specific format is not erased cleanly.
In some related arts, a character erasing task of a specific format may be regarded as an image generating task, an image containing characters of a specific format may be inputted as a model, characters of a specific format may be automatically erased by learning based on a model for generating an countermeasure network, and an image not containing characters of a specific format may be outputted.
However, when the background of the image to be identified is shaded and the background of the image to be identified needs to be turned white, the background cannot be converged in the training process based on the model for generating the countermeasure network, and the background shadow removal can be realized only by taking the background shadow removal as an upstream task of a text erasing task in a specific format.
The embodiments of the present disclosure provide an image processing method, an erasing network training method, an image processing apparatus, an erasing network training apparatus, an electronic device, and a computer readable storage medium, which aim to solve at least one of the above technical problems in the prior art.
The image processing method and the network erasing training method provided by the embodiments of the present disclosure may be performed by electronic devices such as a terminal device or a server, where the terminal device may be a vehicle-mounted device, a user device (UserEquipment, UE), a mobile device, a user terminal, a cellular phone, a cordless phone, a personal digital assistant (PersonalDigitalAssistant, PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like, and the method may be implemented by a processor invoking computer readable program instructions stored in a memory. Alternatively, the method may be performed by a server.
Fig. 1 shows a flowchart of a method of image processing provided by an embodiment of the present disclosure, and as shown in fig. 1, the method of image processing provided by the embodiment of the present disclosure may include step S110 and step S120.
In step S110, a document image to be processed is acquired;
In step S120, erasing the first format text in the document image to be processed by inputting the document to be processed into a pre-trained erasing network, so as to obtain a final erasing image corresponding to the document image to be processed;
the erasing network and the dividing head network and the discriminator network form an erasing network model, and the erasing network is obtained by training the erasing network model in advance by using a sample document image pair; the sample document image pair comprises a sample document image and a label image corresponding to the sample document image, wherein the label image is erased with characters in a first format;
the erasing network model is trained at least according to the content loss function and the style loss function; the content loss function and the style loss function are obtained according to the mixed image characteristics of the sample document image and the label image; and the mixed image features are obtained according to the pixels corresponding to the first format characters in the sample document image and other pixels except the pixels corresponding to the first format characters in the image output by the erasing network.
In some possible implementations, in step S110, the document image to be processed may be an image of a document to be processed having text in a first format; the document image to be processed may also be a document image to be processed having at least the first format text and the second format text.
That is, the document image to be processed may be an image of a document having only one format text; or an image of a document having text in a plurality of formats.
In some specific implementations, the first format text can be handwriting text and the second format text can be printed style text. The document image to be processed may be an image of a test paper with handwriting (e.g., a test paper that has been answered) or an image of an office document with handwriting (e.g., a document with a handwritten signature).
In some possible implementations, in step S120, the erasure network may include a feature extraction sub-network, a coarse erasure sub-network, a fine erasure sub-network.
The feature extraction sub-network is used for extracting features of the document image to be processed and obtaining image features of the document image to be processed.
The pre-trained feature extraction sub-network may be any network with image feature extraction capabilities, such as ResNet (residual network), VGGNet (visual geometry group network), and the like. The deeper the hierarchy of the feature extraction subnetwork, the lower the resolution of the output image features.
In the process of extracting image features by the feature extraction sub-network, the deeper the hierarchy of the feature extraction sub-network, the lower the resolution of the extracted image features, and although the image features with low resolution can be restored to the image features with high resolution by deconvolution and deconvolution, the process of changing the resolution from high to low can lose information, such as space information.
In some possible implementation manners, the problem of spatial information loss caused by the fact that the resolution of the image features is reduced as the depth of the network is increased in the feature extraction sub-network can be avoided by performing parallel connection between the image features with different resolutions.
In some possible implementation manners, the image features with different resolutions can be interactively fused, so that the problem of spatial information loss caused by the fact that the resolution of the image features is reduced as the depth of the network is increased in the feature extraction sub-network is avoided.
The rough erasing sub-network is used for erasing the first format characters in the document image to be processed according to the image characteristics of the document image to be processed, and obtaining a rough erasing image corresponding to the document image to be processed.
The rough erasing sub-network restores the characteristics of the characteristic extraction sub-network to high-resolution representation through up-sampling or deconvolution, and generates a rough erasing image for erasing the first format characters in the document image to be processed by generating a characteristic image with the same size as the document image to be processed.
That is, the rough erasing sub-network is configured to rough erase text in a specific format (e.g., a first format) in an image corresponding to the image feature according to the image feature output by the feature extraction sub-network, and generate an erased image.
In some possible implementations, the coarse erasure subnetwork can be an encoder corresponding to the feature extraction subnetwork.
In some possible implementation manners, the performance of the encoder can be improved by parallel connection between image features with different resolutions, so that the erasure effect of the obtained rough erasure image is improved.
In some possible implementation manners, the performance of the encoder can be improved by performing interactive fusion between image features with different resolutions, and the erasure effect of the obtained rough erasure image is improved.
The fine erasing sub-network is used for erasing the first format characters in the rough erasing image according to the rough erasing image corresponding to the document image to be processed, and obtaining a final erasing image corresponding to the document image to be processed.
In some possible implementations, the fine erasure sub-network may include a decoder and an encoder.
The decoder is used for extracting the characteristics of the rough erasure image input into the decoder, and the encoder is used for upsampling or deconvoluting the characteristics extracted by the decoder to generate a final erasure image for erasing the characters in the first format in the rough erasure image.
That is, the fine erasure sub-network performs fine erasure on the basis of performing the first format text rough erasure on the document image to be processed by the rough erasure sub-network, so as to generate a final erasure image with better erasure effect.
In some possible implementation manners, the performance of the fine erasure sub-network can be improved by parallel connection between image features with different resolutions, and the erasure effect of the obtained final erasure image is improved.
In some possible implementation manners, the performance of the fine erasure sub-network can be improved by performing interactive fusion between image features with different resolutions, and the erasure effect of the obtained final erasure image is improved.
The final erasing image output by the fine erasing sub-network is an image which erases the first format characters of the document image to be processed and retains other format characters (such as the second format characters).
In the case that the first format text is handwriting text and the second format text is printing format text (i.e. printing text), the final erased image is the image which retains the printing format text and erases the handwriting text.
In the case that the document image to be processed is an image of a test paper with handwriting (e.g., a test paper that has been answered), by the image processing method according to the embodiment of the present disclosure, the test paper can be restored to a blank test paper (i.e., a test paper that has not been answered) by one key; in the education toC (user-oriented) scene, the user can practice wrong questions again through one-key test paper restoration; in an educational toB (enterprise-oriented) scene, the OCR detection and recognition task is not interfered by handwriting by one-key test paper restoration, so that the accuracy of question bank record arrangement and test paper format conversion is improved.
In the case where the document image to be processed is an image of an office document having handwriting (e.g., a document having a handwritten signature), the signature or the like of the office document may be erased by the image processing method of the embodiment of the present disclosure, and the original office document may be acquired.
The erasure network used in the image processing method of the embodiment of the present disclosure is a component of erase network model.
That is, the feature extraction sub-network, the coarse erasure sub-network, the fine erasure sub-network, and the segment head (segment head network) and the discriminator network of the erasure network used in the image processing method of the embodiment of the present disclosure together constitute the EraseNet.
Specifically, two branches are connected after the feature extraction sub-network, one branch is a supervision branch, namely a segment head, and the biggest effect of the branch is to determine the position of a Mask (Mask) of the first format text, and use the Mask to restrict the training of the supervision and erasure branch; the other branch is an erasure branch including a coarse erasure sub-network, a fine erasure sub-network, and a discriminator network connected in order. The discriminator network is used for judging whether the first format characters in the final erased image generated by the fine erasing sub-network are erased or not.
Using pairs of sample document images, i.e. sample document image I Input Sample document image I Input Tag image I corresponding to the first format text erased gt Sample document image I Input Mask I of the first format word Mask Training EraseNet to obtain a trained erasing network, or obtaining a trained feature extraction sub-network, a coarse erasing sub-network and a fine erasing sub-network.
The training process of EraseNet may be that a sample document image is input into a feature extraction sub-network, image features corresponding to the sample document image are obtained, the obtained image features are input into Segmentation Head, and Mask of first format characters in the sample image is output; inputting the obtained image characteristics into a rough erasure sub-network, and obtaining a rough erasure image I output by the rough erasure sub-network Cout Will I Cout Inputting the fine erasure sub-network to obtain a final erasure image I output by the fine erasure sub-network Rout And will I Rout And inputting the data into a discriminator network, and obtaining the output of the discriminator network.
The los function (Loss function) of the EraseNet comprises a plurality of components, and in some possible implementations, the EraseNet comprises at least a style Loss function L S Content loss function L Perc
Style loss function L S Content loss function L Perc The term can be obtained by the following formula:
wherein phi is n (I Rout ) To I Rout Inputting a pre-trained image feature extraction network (such as VGG-16 network), and outputting feature map by an nth layer of the image feature extraction network; phi (phi) n (I gt ) Respectively, will I gt Inputting a pre-trained image feature extraction network (such as VGG-16 network), and outputting a feature map by an nth layer of the image feature extraction network; phi (phi) n (I Com ) Respectively, will I Com Inputting a pre-trained image feature extraction network (such as VGG-16 network), and outputting a feature map by an nth layer of the image feature extraction network; h n 、W n 、C n Representing the height, width and channel of the nth layer of the image feature extraction network respectively; n is the total layer number of the image feature extraction network; i i Is I Rout And I Com
I Com The method is characterized by mixing image features, and is determined according to pixels corresponding to the first format characters in the sample document image and other pixels except the pixels corresponding to the first format characters in the image output by the erasing network.
Specifically, I Com Can be determined according to the following formula:
I Com =I Mask *I gt +(1-I Mask )*I Rout
Due to I Mask Only the first format text has a position of 1 and the rest positions of 0, (1-I) Mask ) Then only the first format text has a position of 0 and the rest positions are 1, I Mask *I gt The obtained pixels corresponding to the first format text in the sample document image are (1-I) Mask )*I Rout The obtained pixels in the output image except the pixels corresponding to the characters in the first format calculate I according to the formula Com Through I Com The content loss function and the style loss function are calculated, so that other areas except the first format text can be supervised through the content loss function and the style loss function, decoupling of the supervision branch and the erasing branch is realized, and the obtained erasing effect of the erasing network is improved.
In some possible implementations, the los function of EraseNet also includes Local-aware ReconstructionLoss (Local perceived reconstruction Loss) L LR The digloss of the supervision branch and the erasure network as the GanLoss of the generator network and the arbiter network.
In some possible implementation manners, in the case that the label image corresponding to the sample document image of the used training EraseNet is an image obtained by erasing text in the first format for the sample document image and performing a shadow removing operation on a background part except for a text part in the sample document image (i.e. performing a white turning on a background except for text), the erasing network obtained by training can perform not only text erasing operation on the document image to be processed but also shadow removing operation on the document image to be processed (i.e. performing a white turning on a background except for text), so as to generate a final erasing image obtained by erasing text in the first format and performing a white turning on the background.
In the image processing method provided by the embodiment of the disclosure, the first format text in the document image to be processed is erased through an end-to-end erasing network; meanwhile, the mixed image characteristics are determined by using the pixels corresponding to the first format characters and other pixels outside the pixels corresponding to the first format characters, so that decoupling of the segmentation head network and the erasing network is realized, the segmentation head network and the erasing network are guaranteed to take their own roles, and the accuracy of the final erasing image output by the erasing network obtained through training is improved.
The method for image processing provided by the embodiment of the present disclosure is specifically described below.
As described above, in some possible implementations, in the case where the label image corresponding to the sample document image of the training ericnet used erases the first format text for the sample document image and performs the shadow removal operation on the background portion of the text portion (including the text of various formats) in the sample document image, the erasing network obtained by training may perform not only text erasure but also shadow removal operation on the document image to be processed, and generates the final erased image in which the first format text is erased and the background is turned white.
That is, the image processing method provided by the embodiment of the present disclosure may implement the shadow removing operation on the background portion of the document image to be processed while erasing the first format text in the document image to be processed using the end-to-end network model.
FIG. 2 is a flow chart of one embodiment of obtaining a final erased image via an erasure network in the case where a tag image is an image of a sample document image from which text in a first format has been erased and a background portion of the text-removed portion of the sample document image has been subjected to a shadow removal operation. As described in fig. 2, acquiring the final erase image through the erase network may include step S210.
In step S210, the first format text in the document image to be processed is erased by inputting the document to be processed into a pre-trained erasing network, and the background part of the text-removed part in the document image to be processed is subjected to shadow removal operation, so as to obtain a final erased image corresponding to the document image to be processed.
In some possible implementations, in step S210, the training data (i.e., the sample document image pair) corresponding to the erasure network is changed, but the training method of the erasure network is not changed.
The erasure network and the segment head network together form an erase net. The specific composition and training process of EraseNet are also described above and will not be described in detail herein.
That is, I Com Also, the method is determined according to the pixels corresponding to the first format characters in the sample document image and other pixels except the pixels corresponding to the first format characters in the image output by the erasing network, thus, the method is realized by I Com The content loss function and the style loss function are calculated, so that other areas except the first format characters can be supervised through the content loss function and the style loss function, interference of the other areas except the first format characters on a background part in the training process is avoided, decoupling of supervision branches and erasing branches is realized, and the obtained erasing effect and shadow removing effect of the erasing network are improved.
Meanwhile, under the condition that the label image corresponding to the sample document image is an image obtained by erasing characters in a first format for the sample document image and performing shadow removal operation on a background part of the character part (including characters in various formats) in the sample document image, the erasing model provided by the embodiment of the disclosure can realize erasing+background shadow removal of the characters in the first format in the document image to be processed through an end-to-end network.
As described above, the erasure model provided by the embodiments of the present disclosure includes a feature extraction sub-network, a coarse erasure sub-network, and a fine erasure sub-network.
Fig. 3 is a schematic diagram showing a flow of acquiring a final erasure image corresponding to a document image to be processed through an erasure network in the case where the erasure network includes a feature extraction sub-network, a coarse erasure sub-network, and a fine erasure sub-network, and as described in fig. 3, acquiring the final erasure image corresponding to the document image to be processed through the erasure network may include step S310, step S320, and step S330.
In step S310, the feature extraction is performed on the document image to be processed by inputting the document image to be processed into a feature extraction sub-network trained in advance, so as to obtain the image features of the document image to be processed;
in step S320, erasing the first format text in the document image to be processed according to the image features by inputting the image features into a pre-trained rough erasing sub-network, so as to obtain a rough erasing image corresponding to the document image to be processed;
in step S330, the first format text in the rough erasure image is erased by inputting the rough erasure image into the pre-trained fine erasure sub-network, so as to obtain a final erasure image corresponding to the document image to be processed.
In some possible implementations, in step S310, the feature extraction sub-network is configured to perform feature extraction on the document image to be processed, and obtain image features of the document image to be processed.
The feature extraction sub-network may be any network having image feature extraction capabilities, such as ResNet, VGGNet.
The deeper the hierarchy of the feature extraction subnetwork, the lower the resolution of the output image features. Although low resolution image features can be restored to high resolution image features by deconvolution and deconvolution, the process of changing resolution from high to low may still lose information, such as spatial information.
In some possible implementations, in step S320, the rough erasure sub-network restores the features output by the feature extraction sub-network to the high resolution representation by up-sampling or deconvolution, generates a feature map having the same size as the document image to be processed, and generates a rough erasure image in which the first format text in the document image to be processed is erased and the background portion of the text portion is removed.
That is, the rough erasure sub-network is configured to perform rough erasure on characters in a specific format (such as a first format) in an image corresponding to the image features according to the image features output by the feature extraction sub-network, and perform shadow removal operation on background parts except for character parts of all the characters in the format.
In some possible implementations, in step S330, the fine erasure sub-network may include a decoder and an encoder.
The decoder is used for extracting the characteristics of the rough erasure image input into the decoder, and the encoder is used for upsampling or deconvoluting the characteristics extracted by the decoder to generate a final erasure image which erases the characters in the first format in the rough erasure image and removes the shadows of the background part.
That is, the fine erasing sub-network performs the first format text rough erasing on the document image to be processed and performs the fine erasing on the background part on the basis of the rough erasing sub-network, so as to generate the final erasing image with better erasing effect and better shadow removing effect.
As described above, in some possible implementations, the image features with different resolutions may be interactively fused, so as to avoid the problem of spatial information loss caused by the decrease of the resolution of the image features, which is caused by the deepening of the network depth and is present in the feature extraction sub-network.
Fig. 4 shows a schematic diagram of a network for interactive fusion between image features of different resolutions on the basis of a feature extraction sub-network.
As shown in the left half of fig. 4, during the downsampling of the network, the output of the network layer is obtained not only by downsampling of the output of its corresponding upper network layer, but also by the result of the upsampling of its corresponding lower network layer.
Fig. 5 shows a flow diagram of one embodiment of acquiring image features using the network shown in fig. 4, which may include step S510, step S520, and step S530, as shown in fig. 5.
In step S510, inputting the document image to be processed into a feature extraction sub-network trained in advance, and extracting image features of the document image to be processed to obtain scale image features of different scales corresponding to the document image to be processed;
in step S520, up-sampling is performed on the scale image features of the next scale corresponding to the scale image features to obtain up-sampled image features;
in step S530, image features of the document image to be processed are acquired from the scale image features and the up-sampled image features.
In some possible implementations, in step S510, feature extraction is performed on the document image to be processed through the feature extraction sub-network trained in advance, and scale image features of different scales output by different network layers of the feature extraction sub-network are acquired.
The deeper the network layer of the feature extraction sub-network layer, the lower the resolution of the image features it outputs, and the smaller the scale.
In some possible implementations, in step S520, for each scale image feature, the corresponding scale image feature of the next scale, that is, the scale image feature output by the next network layer, an upsampled image feature having the same resolution as the scale image feature may be obtained by upsampling the scale image feature output by the next network layer.
In some possible implementations, in step S530, the scaled image features of the input coarse erasure sub-network may be obtained by summing the scaled image features and the up-sampled image features.
Because the up-sampling image features are obtained according to the scale image features (namely the image features with different resolutions) output by the next network layer, the interactive fusion between the image features with different resolutions can be realized by adding the scale image features and the up-sampling image features, so that the image features are kept as lossless as possible.
In some possible implementation manners, the performance of the encoder of the rough erasure sub-network can be improved by performing interactive fusion between image features with different resolutions, so that the erasure effect of the obtained rough erasure image is improved.
The encoder of the coarse erasure sub-network can be obtained by up-sampling the output of the network layer not only from the output of its corresponding upper network layer, but also from the down-sampling result of the output of its corresponding lower network layer during up-sampling of the network, as shown in the right half of fig. 4.
Fig. 6 shows a flow diagram of one embodiment of acquiring a rough erasure image using the network shown in fig. 4, which may include step S610, step S620, step S630, step S640 using the network shown in fig. 4, as shown in fig. 6.
In step S610, inputting the image features into an encoder, and upsampling the image features to obtain scale image features of different scales;
in step S620, downsampling the scale image features of the next scale corresponding to the scale image features to obtain downsampled image features;
in step S630, final image features corresponding to the scale image features are obtained according to the scale image features and the downsampled image features;
in step S640, a rough erasure image corresponding to the document image to be processed is acquired based on the final image characteristics.
In some possible implementations, in step S610, the image features of the document image to be processed acquired using the network shown in fig. 4 are input to the encoder of the coarse erasure sub-network, and the scale image features of different scales are acquired by upsampling or deconvoluting the image features.
In some possible implementations, as shown in fig. 4, the scale image features of each scale are obtained according to the image features of the document image to be processed of the same scale output by the feature extraction sub-network shown in fig. 4 (i.e., shown by the dashed line in fig. 4) and the upsampling result of the scale image features output by the encoder layer of the corresponding upper layer.
In some possible implementations, in step S620, for each scale image feature, the corresponding scale image feature of the next scale, that is, the scale image feature output by the next network layer, a downsampled image feature having the same resolution as the scale image feature may be obtained by downsampling the scale image feature output by the next network layer.
In some possible implementations, in step S630, final image features for the scale may be obtained by summing the scale image features and the downsampled image features.
In some possible implementations, in step S640, a rough erasure image may be generated by the final image features corresponding to the plurality of scale image features, and the method of generating the rough erasure image according to the final image features corresponding to the plurality of scale image features is not limited in the embodiments of the present disclosure.
Because the downsampled image features are obtained according to the scale image features (namely the image features with different resolutions) output by the next network layer, the interactive fusion between the image features with different resolutions can be realized by adding the scale image features and the downsampled image features, so that the image features are kept as lossless as possible.
In some possible implementations, the performance of the encoder and decoder of the fine erasure sub-network can also be improved by performing interactive fusion between image features with different resolutions, so that the erasure effect of the obtained final erasure image is improved.
The decoder of the fine erasure sub-network can be obtained by not only downsampling the output of its corresponding upper network layer, but also upsampling the output of its corresponding lower network layer together during upsampling of the network, as shown in the left half of fig. 4.
The encoder of the fine erasure sub-network can be obtained by up-sampling the output of the network layer not only from the output of its corresponding upper network layer, but also from the down-sampling of the output of its corresponding lower network layer during up-sampling of the network, as shown in the right half of fig. 4.
Fig. 7 shows a flow diagram of one embodiment of acquiring a final erase image using the network shown in fig. 4, which may include step S710, step S720, step S730, step S740, step S750, step S760, step S770, as described in fig. 4.
In step S710, obtaining first-scale image features of different scales corresponding to the rough erasure image by inputting the rough erasure image into a decoder;
in step S720, up-sampling is performed on the first scale image feature of the next scale corresponding to the first scale image feature to obtain a first up-sampled image feature;
in step S730, image features corresponding to the rough erasure image are obtained according to the first scale image features and the first upsampled image features;
in step S740, inputting the image features corresponding to the rough erasure image into the encoder, and up-sampling the image features corresponding to the rough erasure image to obtain second scale image features with different scales;
in step S750, downsampling the second scale image features of the next scale corresponding to the second scale image features to obtain second downsampled image features;
in step S760, final image features corresponding to the second scale image features are obtained according to the second scale image features and the second downsampled image features;
In step S770, a final erased image corresponding to the document image to be processed is obtained by the final image feature.
In some possible implementations, in step S710, feature extraction is performed on the rough erasure image by the decoder, and the first scale image features of different scales output by different network layers of the decoder are obtained. The deeper the network layer of the decoder, the lower the resolution of the image features it outputs, and the smaller the scale.
In some possible implementations, in step S720, for each first-scale image feature, the corresponding next-scale image feature, that is, the first-scale image feature output by the network layer of the next decoder, a first upsampled image feature having the same resolution as the first-scale image feature may be obtained by upsampling the first-scale image feature output by the next network layer.
In some possible implementations, in step S730, the scale image features of the input encoder may be obtained by adding the first scale image features and the first upsampled image features.
In some possible implementations, in step S740, the image features of the coarse erasure image acquired using the decoder shown in fig. 4 are input to the encoder of the fine erasure sub-network, and the second-scale image features of different scales are acquired by upsampling or deconvoluting the image features of the coarse erasure image.
In some possible implementations, as shown in fig. 4, the second-scale image features of each scale are obtained from the upsampled results of the same-scale image features output by the decoder shown in fig. 4 (i.e., shown in dashed lines in fig. 4) and the second-scale image features output by the network layer of its corresponding upper layer encoder.
In some possible implementations, in step S750, for each second-scale image feature, its corresponding next-scale image feature, that is, the second-scale image feature output by the next encoder network layer, a second downsampled image feature having the same resolution as the second-scale image feature may be obtained by downsampling the second-scale image feature output by the next encoder network layer.
In some possible implementations, in step S760, the final image feature corresponding to the scale may be obtained by adding the second scale image feature and the second downsampled image feature.
In some possible implementations, in step S770, a final erasure image may be generated by the final image features of multiple scales, which are not limited by the disclosed embodiments of the method of generating a final erasure image.
Because the first up-sampling image features are obtained according to the scale image features (namely the image features with different resolutions) output by the next network layer, the first scale image features and the first up-sampling image features can be added to realize interactive fusion between the image features with different resolutions, so that the image features are kept as lossless as possible.
Meanwhile, the second downsampled image feature page is obtained according to the scale image features (namely the image features with different resolutions) output by the next network layer, and the second scale image features and the second downsampled image features are added to realize interactive fusion between the image features with different resolutions so as to keep the image features as lossless as possible.
Fig. 8 shows a flowchart of a training method of erasing a network according to an embodiment of the present disclosure, and as shown in fig. 8, a method of image processing according to an embodiment of the present disclosure may include step S810 and step S820.
In step S810, a sample document image pair is acquired, where the sample document image pair includes a sample document image and a tag image corresponding to the sample document image, in which the first format text is erased;
in step S820, training the erasure network model according to the sample document image pair to obtain an erasure network;
The erasing network and the dividing head network and the discriminator network form an erasing network model; the erasing network is used for erasing the first format characters in the sample document image input into the erasing network to obtain a final erasing image corresponding to the sample document image; the segmentation head network is used for acquiring a mask corresponding to the first format text in the sample document image; the discriminator network is used for judging whether the first format characters in the final erased image are erased or not;
the erasing network model is trained at least according to the content loss function and the style loss function; the content loss function and the style loss function are obtained according to the mixed image characteristics of the sample document image and the label image; and the mixed image features are obtained according to the pixels corresponding to the first format characters in the sample document image and other pixels except the pixels corresponding to the first format characters in the image output by the erasing network.
In some possible implementations, in step S810, the sample document image may be an image of a document having text in a first format; the sample document image may be a document image having at least a first format text and a second format text.
That is, the sample document image may be an image of a document having only one format text; or an image of a document having text in a plurality of formats.
In some specific implementations, the first format text can be handwriting text and the second format text can be printed style text. The sample document image may be an image of a test paper with handwriting (e.g., a test paper that has been answered) or an image of an office document with handwriting (e.g., a document with a handwritten signature).
In some possible implementations, in step S820, the erasure network may include a feature extraction sub-network, a coarse erasure sub-network, a fine erasure sub-network.
The characteristic extraction sub-network is used for extracting characteristics of the sample document image and acquiring image characteristics of the document image to be processed.
The feature extraction sub-network may be any network having image feature extraction capabilities, such as ResNet, VGGNet. The deeper the hierarchy of the feature extraction subnetwork, the lower the resolution of the output image features.
In the process of extracting image features by the feature extraction sub-network, the deeper the hierarchy of the feature extraction sub-network, the lower the resolution of the extracted image features, and although the image features with low resolution can be restored to the image features with high resolution by deconvolution and deconvolution, the process of changing the resolution from high to low can lose information, such as space information.
In some possible implementation manners, the problem of spatial information loss caused by the fact that the resolution of the image features is reduced as the depth of the network is increased in the feature extraction sub-network can be avoided by performing parallel connection between the image features with different resolutions.
In some possible implementation manners, the image features with different resolutions can be interactively fused, so that the problem of spatial information loss caused by the fact that the resolution of the image features is reduced as the depth of the network is increased in the feature extraction sub-network is avoided.
The rough erasing sub-network is used for erasing the characters in the first format in the sample document image according to the image characteristics of the sample document image to be processed, and obtaining a rough erasing image corresponding to the sample document image.
In some possible implementations, the rough erasure sub-network restores the features output by the feature extraction sub-network to the high resolution representation by up-sampling or deconvolution, generates a feature map of the same size as the sample document image, and generates a rough erasure image that erases the first format text in the sample document image.
That is, the rough erasing sub-network is configured to rough erase text in a specific format (e.g., a first format) in an image corresponding to the image feature according to the image feature output by the feature extraction sub-network, and generate an erased image.
In some possible implementations, the coarse erasure subnetwork can be an encoder corresponding to the feature extraction subnetwork.
In some possible implementation manners, the performance of the encoder can be improved by parallel connection between image features with different resolutions, so that the erasure effect of the obtained rough erasure image is improved.
In some possible implementation manners, the performance of the encoder can be improved by performing interactive fusion between image features with different resolutions, and the erasure effect of the obtained rough erasure image is improved.
The fine erasing sub-network is used for erasing the first format characters in the rough erasing image according to the rough erasing image corresponding to the sample document image, and obtaining a final erasing image corresponding to the sample document image.
In some possible implementations, the fine erasure sub-network may include a decoder and an encoder.
The decoder is used for extracting the characteristics of the rough erasure image input into the decoder, and the encoder is used for upsampling or deconvoluting the characteristics extracted by the decoder to generate a final erasure image for erasing the characters in the first format in the rough erasure image.
That is, the fine erasure sub-network performs fine erasure on the basis of performing the first format text rough erasure on the sample document image by the rough erasure sub-network, so as to generate a final erasure image with better erasure effect.
In some possible implementation manners, the performance of the fine erasure sub-network can be improved by parallel connection between image features with different resolutions, and the erasure effect of the obtained final erasure image is improved.
In some possible implementation manners, the performance of the fine erasure sub-network can be improved by performing interactive fusion between image features with different resolutions, and the erasure effect of the obtained final erasure image is improved.
The erasure network is an integral part of the erase network model.
Specifically, the erasure network composed of the feature extraction sub-network, the coarse erasure sub-network and the fine erasure sub-network, the segmentionhead and the discriminator network together form an EraseNet.
Specifically, two branches are connected after the feature extraction sub-network, one branch is a supervision branch, namely a segment head, and the biggest effect of the branch is to determine the position of a Mask (Mask) of the first format text, and use the Mask to restrict the training of the supervision and erasure branch; the other branch is an erasure branch including a coarse erasure sub-network, a fine erasure sub-network, and a discriminator network connected in order. The discriminator network is used for judging whether the first format characters in the final erased image generated by the fine erasing sub-network are erased or not.
Using pairs of sample document images, i.e. sample document image I Input Sample document image I Input Tag image I corresponding to the first format text erased gt Sample document image I Input Mask I of the first format word Mask Training EraseNet to obtain a trained erasing network, or obtaining a trained feature extraction sub-network, a coarse erasing sub-network and a fine erasing sub-network.
The training process of EraseNet may be that a sample document image is input into a feature extraction sub-network, image features corresponding to the sample document image are obtained, the obtained image features are input into Segmentation Head, and Mask of first format characters in the sample image is output; inputting the obtained image characteristics into a rough erasure sub-network, and obtaining a rough erasure image I output by the rough erasure sub-network Cout Will I Cout Input fine eraseA sub-network for obtaining the final erasing image I output by the fine erasing sub-network Rout And will I Rout And inputting the data into a discriminator network, and obtaining the output of the discriminator network.
The los function (Loss function) of the EraseNet comprises a plurality of components, and in some possible implementations, the EraseNet comprises at least a style Loss function L S Content loss function L Perc
Style loss function L S Content loss function L Perc The term can be obtained by the following formula:
wherein phi is n (I Rout ) To I Rout Inputting a pre-trained image feature extraction network (such as VGG-16m network ap (), a feature image map); phi-relieving device n (I fetch) gt ) The network is respectively the second node nI gt The layer input p-in oo prelin first g-l training ay training er (image pool imaging special layer) sign output extracted network f network eat (ur such as eVGG-16 network), image feature extraction network n layer imaging-layer output featuremap; phi (phi) n (I Com ) Respectively, will I Com Inputting a pre-trained image feature extraction network (such as VGG-16 network), and outputting a feature map by an nth layer of the image feature extraction network; h n 、W n 、C n Representing the height, width and channel of the nth layer of the image feature extraction network respectively; n is the total layer number of the image feature extraction network; i i Is I Rout And I Com
I Com The method is characterized by mixing image features, and is determined according to pixels corresponding to the first format characters in the sample document image and other pixels except the pixels corresponding to the first format characters in the image output by the erasing network.
Specifically, I Com Can be determined according to the following formula:
I Com =I Mask *I gt +(1-I Mask )*I Rout
due to I Mask Only the first format text has a position of 1 and the rest positions of 0, (1-I) Mask ) Then only the first format text has a position of 0 and the rest positions are 1, I Mask *I gt The obtained pixels corresponding to the first format text in the sample document image are (1-I) Mask )*I Rout Acquiring other pixels except the pixels corresponding to the characters in the first format in the image output by the fine erasing sub-network of the erasing network, and calculating I according to the formula Com Through I Com The content loss function and the style loss function are calculated, so that other areas except the first format text can be supervised through the content loss function and the style loss function, decoupling of the supervision branch and the erasing branch is realized, and the obtained erasing effect of the erasing network is improved.
In some possible implementations, the los function of EraseNet also includes Local-aware ReconstructionLossL LR The DiceLoss of the supervision branch and the GanLoss of the generator network and the arbiter network as the erasure network.
In some possible implementation manners, in the case that the label image corresponding to the sample document image of the used training EraseNet is an image obtained by erasing text in the first format for the sample document image and performing a shadow removing operation on a background part except for a text part in the sample document image (i.e. performing a white turning on a background except for text), the obtained erasing network after training can perform not only text erasing operation on the sample document image but also shadow removing operation on the sample document image (i.e. performing a white turning on a background except for text), so as to generate a final erasing image obtained by erasing text in the first format and performing a white turning on the background.
In the training method of the erasing network provided by the embodiment of the disclosure, the erasing of the first format text in the sample document image is realized through the end-to-end erasing network; meanwhile, the mixed image characteristics are determined by using the pixels corresponding to the first format characters and other pixels outside the pixels corresponding to the first format characters, so that decoupling of the segmentation head network and the erasing network is realized, the segmentation head network and the erasing network are guaranteed to take their own roles, and the accuracy of the final erasing image output by the erasing network obtained through training is improved.
As described above, in some possible implementations, the tag image is an image of the sample document image from which the first format text is erased and the background portion of the text-removed portion of the sample document image is subjected to the shadow removal operation.
Under the condition that the erasing network comprises a feature extraction sub-network, a rough erasing sub-network and a fine erasing sub-network, the rough erasing sub-network is used for erasing first format characters in a sample document image according to image features of the sample document image input into the rough erasing sub-network, and performing shadow removing operation on a background part except for a character part in the sample document image to obtain a rough erasing image corresponding to the sample document image; the fine erasing sub-network is used for erasing characters in a first format in the coarse erasing image according to the coarse erasing image input into the fine erasing sub-network, and performing shadow removing operation on a background part except for the character part in the coarse erasing image to obtain a final erasing image corresponding to the sample document image.
That is, in the case that the label image corresponding to the sample document image of the training ericnet is an image obtained by erasing text in the first format and performing a shadow removal operation on the background portion of the text portion (including text in various formats) in the sample document image, the feature extraction sub-network, the coarse erasure sub-network and the fine erasure sub-network obtained by training not only can perform text erasure on the sample document image but also can perform a shadow removal operation on the sample document image, so as to generate a final erasure image from which text in the first format is erased and the background is turned white.
That is, the end-to-end network model obtained by the training method for erasing a network according to the embodiment of the present disclosure may perform the shadow removing operation on the background portion of the sample document image while erasing the first format text in the sample document image.
The structure of the coarse erasure sub-network and the fine erasure sub-network is as described above, and the coarse erasure sub-network and the fine erasure sub-network also constitute the EraseNet together with the feature extraction sub-network, the segment head network, and the discriminator network. The specific composition and training process of EraseNet is also described above.
That is, I Com Also, the method is determined according to the pixels corresponding to the first format characters in the sample document image and other pixels except the pixels corresponding to the first format characters in the image output by the fine erasing sub-network, thus, the method is realized by I Com The content loss function and the style loss function are calculated, so that other areas except the first format characters can be supervised through the content loss function and the style loss function, interference of other areas except the first format characters on a background part in the training process is avoided, decoupling of supervision branches and erasing branches is realized, and the obtained erasing effects and shadow removing effects of the feature extraction sub-network, the rough erasing sub-network and the fine erasing sub-network are improved.
As described above, in some possible implementations, the image features with different resolutions may be interactively fused, so as to avoid the problem of spatial information loss caused by the decrease of the resolution of the image features, which is caused by the deepening of the network depth and is present in the feature extraction sub-network.
Fig. 4 shows a schematic diagram of a network for interactive fusion between image features of different resolutions on the basis of a feature extraction sub-network.
As shown in the left half of fig. 4, during the downsampling of the network, the output of the network layer is obtained not only by downsampling of the output of its corresponding upper network layer, but also by the result of the upsampling of its corresponding lower network layer.
That is, the feature extraction sub-network is used to: extracting image features of the sample document image to obtain scale image features of different scales corresponding to the sample document image; up-sampling the scale image features of the next scale corresponding to the scale image features to obtain up-sampled image features; and acquiring the image characteristics of the sample document image according to the scale image characteristics and the up-sampling image characteristics.
Specifically, feature extraction is carried out on the sample document image through the feature extraction sub-network, and the scale image features with different scales output by different network layers of the feature extraction sub-network are obtained. The deeper the network layer of the feature extraction sub-network layer, the lower the resolution of the image features it outputs, and the smaller the scale.
For each scale image feature, the corresponding scale image feature of the next scale, namely the scale image feature output by the next network layer, and up-sampling the scale image feature output by the next network layer can obtain the up-sampling image feature with the same resolution as the scale image feature.
The scale image features of the input coarse erasure sub-network may be obtained by summing the scale image features and the up-sampled image features.
Because the up-sampling image features are obtained according to the scale image features (namely the image features with different resolutions) output by the next network layer, the interactive fusion between the image features with different resolutions can be realized by adding the scale image features and the up-sampling image features, so that the image features are kept as lossless as possible.
In some possible implementation manners, the performance of the encoder of the rough erasure sub-network can be improved by performing interactive fusion between image features with different resolutions, so that the erasure effect of the obtained rough erasure image is improved.
The encoder of the coarse erasure sub-network can be obtained by up-sampling the output of the network layer not only from the output of its corresponding upper network layer, but also from the down-sampling result of the output of its corresponding lower network layer during up-sampling of the network, as shown in the right half of fig. 4.
That is, in some possible implementations, the coarse erasure sub-network includes an encoder corresponding to the feature extraction sub-network; the encoder is used for: up-sampling the image features to obtain scale image features with different scales; downsampling the scale image features of the next scale corresponding to the scale image features to obtain downsampled image features; acquiring final image features corresponding to the scale image features according to the scale image features and the downsampled image features; and acquiring a rough erasure image corresponding to the sample document image according to the final image characteristics.
Specifically, the image features of the sample document image acquired using the network shown in fig. 4 are input to the encoder of the coarse erasure sub-network, and the scale image features of different scales are acquired by upsampling or deconvoluting the image features.
As shown in fig. 4, the scale image features of each scale are obtained from the image features of the sample document image of the same scale output by the feature extraction sub-network shown in fig. 4 (i.e., shown by the dashed line in fig. 4) and the upsampling result of the scale image features output by the encoder layer of the corresponding upper layer.
For each scale image feature, the corresponding scale image feature of the next scale, namely the scale image feature output by the next network layer, and downsampled image features with the same resolution as the scale image feature can be obtained by downsampling the scale image feature output by the next network layer.
Final image features for the scale may be obtained by summing the scale image features and the downsampled image features.
The method of generating the rough erasure image according to the final image features corresponding to the plurality of scale image features is not limited in this disclosure.
Because the downsampled image features are obtained according to the scale image features (namely the image features with different resolutions) output by the next network layer, the interactive fusion between the image features with different resolutions can be realized by adding the scale image features and the downsampled image features, so that the image features are kept as lossless as possible.
In some possible implementations, the performance of the encoder and decoder of the fine erasure sub-network can also be improved by performing interactive fusion between image features with different resolutions, so that the erasure effect of the obtained final erasure image is improved.
The decoder of the fine erasure sub-network can be obtained by not only downsampling the output of its corresponding upper network layer, but also upsampling the output of its corresponding lower network layer together during upsampling of the network, as shown in the left half of fig. 4.
The encoder of the fine erasure sub-network can be obtained by up-sampling the output of the network layer not only from the output of its corresponding upper network layer, but also from the down-sampling of the output of its corresponding lower network layer during up-sampling of the network, as shown in the right half of fig. 4.
In some possible implementations, the coarse erasure sub-network includes an encoder corresponding to the feature extraction sub-network; the encoder is used for: up-sampling the image features to obtain scale image features with different scales; downsampling the scale image features of the next scale corresponding to the scale image features to obtain downsampled image features; acquiring final image features corresponding to the scale image features according to the scale image features and the downsampled image features; and acquiring a rough erasure image corresponding to the sample document image according to the final image characteristics.
That is, in some possible implementations, the fine erasure sub-network includes a decoder and an encoder; the decoder is used for: the method comprises the steps of inputting a rough erasure image into a decoder to obtain first-scale image features of different scales corresponding to the rough erasure image; up-sampling the first scale image features of the next scale corresponding to the first scale image features to obtain first up-sampled image features; acquiring image features corresponding to the rough erasure image according to the first scale image features and the first up-sampling image features; the encoder is used for: up-sampling the image features corresponding to the rough erasure image to obtain second-scale image features with different scales; downsampling the second scale image features of the next scale corresponding to the second scale image features to obtain second downsampled image features; acquiring final image features corresponding to the second scale image features according to the second scale image features and the second downsampled image features; and acquiring a final erasing image corresponding to the sample document image through the final image characteristics.
Specifically, the feature extraction is performed on the rough erasure image through the decoder, and the first scale image features of different scales output by different network layers of the decoder are obtained. The deeper the network layer of the decoder, the lower the resolution of the image features it outputs, and the smaller the scale.
For each first scale image feature, the corresponding next scale image feature, namely the first scale image feature output by the network layer of the next decoder, the first up-sampling image feature with the same resolution as the first scale image feature can be obtained by up-sampling the first scale image feature output by the next network layer.
Image features of the scale of the input encoder may be obtained by adding the first scale image features and the first upsampled image features.
The image features of the coarse erasure image acquired using the decoder shown in fig. 4 are input to the encoder of the fine erasure sub-network, and the second-scale image features of different scales are acquired by upsampling or deconvoluting the image features of the coarse erasure image.
As shown in fig. 4, the second-scale image features of each scale are obtained from the upsampling result of the same-scale image features output by the decoder shown in fig. 4 (i.e., shown in dashed lines in fig. 4) and the second-scale image features output by the network layer of its corresponding upper layer encoder.
For each second scale image feature, the corresponding next scale image feature, namely the second scale image feature output by the next encoder network layer, the second downsampled image feature with the same resolution as the second scale image feature can be obtained by downsampling the second scale image feature output by the next encoder network layer.
The final image feature corresponding to the scale may be obtained by adding the second scale image feature and the second downsampled image feature.
The method of generating the final erased image from the plurality of scale final image features is not limited in this disclosure.
Because the first up-sampling image features are obtained according to the scale image features (namely the image features with different resolutions) output by the next network layer, the first scale image features and the first up-sampling image features can be added to realize interactive fusion between the image features with different resolutions, so that the image features are kept as lossless as possible.
Meanwhile, the second downsampled image feature page is obtained according to the scale image features (namely the image features with different resolutions) output by the next network layer, and the second scale image features and the second downsampled image features are added to realize interactive fusion between the image features with different resolutions so as to keep the image features as lossless as possible.
Based on the same principle as the method shown in fig. 1, fig. 9 shows a schematic structural diagram of an apparatus for image processing provided by an embodiment of the present disclosure, and as shown in fig. 9, an apparatus 90 for image processing may include:
an image module 910, configured to obtain an image of a document to be processed;
the reasoning module 920 is configured to erase the first text format in the document image to be processed by inputting the document to be processed into a pre-trained erasing network, so as to obtain a final erased image corresponding to the document image to be processed;
the erasing network and the dividing head network and the discriminator network form an erasing network model, and the erasing network is obtained by training the erasing network model in advance by using a sample document image pair; the sample document image pair comprises a sample document image and a label image corresponding to the sample document image, wherein the label image is erased with characters in a first format;
the erasing network model is trained at least according to the content loss function and the style loss function; the content loss function and the style loss function are obtained according to the mixed image characteristics of the sample document image and the label image; and the mixed image features are obtained according to the pixels corresponding to the first format characters in the sample document image and other pixels except the pixels corresponding to the first format characters in the image output by the erasing network.
In the image processing device provided by the embodiment of the disclosure, the first format text in the document image to be processed is erased through an end-to-end erasing network; meanwhile, the mixed image characteristics are determined by using the pixels corresponding to the first format characters and other pixels outside the pixels corresponding to the first format characters, so that decoupling of the segmentation head network and the erasing network is realized, the segmentation head network and the erasing network are guaranteed to take their own roles, and the accuracy of the final erasing image output by the erasing network obtained through training is improved.
It will be appreciated that the above-described modules of the apparatus for image processing in the embodiments of the present disclosure have functions of implementing the respective steps of the method for image processing in the embodiment shown in fig. 1. The functions can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above. The modules may be software and/or hardware, and each module may be implemented separately or may be implemented by integrating multiple modules. For the functional description of each module of the above image processing apparatus, reference may be specifically made to the corresponding description of the image processing method in the embodiment shown in fig. 1, which is not repeated herein.
Based on the same principle as the method shown in fig. 8, fig. 10 shows a schematic structural diagram of a training device for an erasing network according to an embodiment of the present disclosure, and as shown in fig. 10, the training device 100 for an erasing network may include:
a sample module 1010, configured to obtain a sample document image pair, where the sample document image pair includes a sample document image and a tag image corresponding to the sample document image, in which the first format text is erased;
the training module 1020 is configured to train the erasure network model according to the sample document image pair to obtain an erasure network;
the erasure network and the segmentation head network and the discriminator network form an erasure network model; the erasing network is used for erasing the first format characters in the sample document image input into the erasing network to obtain a final erasing image corresponding to the sample document image; the segmentation head network is used for acquiring a mask corresponding to the first format text in the sample document image; the discriminator network is used for judging whether the first format characters in the final erased image are erased or not;
the erasing network model is trained at least according to the content loss function and the style loss function; the content loss function and the style loss function are obtained according to the mixed image characteristics of the sample document image and the label image; and the mixed image features are obtained according to the pixels corresponding to the first format characters in the sample document image and other pixels except the pixels corresponding to the first format characters in the image output by the erasing network.
In the training device of the erasing network provided by the embodiment of the disclosure, the erasing of the first format text in the document image to be processed is realized through the end-to-end erasing network; meanwhile, the mixed image characteristics are determined by using the pixels corresponding to the first format characters and other pixels outside the pixels corresponding to the first format characters, so that decoupling of the segmentation head network and the erasing network is realized, the segmentation head network and the erasing network are guaranteed to take their own roles, and the accuracy of the final erasing image output by the erasing network obtained through training is improved.
It will be appreciated that the above-described modules of the training apparatus for an erasure network in the embodiments of the present disclosure have the function of implementing the corresponding steps of the training method for an erasure network in the embodiment shown in fig. 8. The functions can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above. The modules may be software and/or hardware, and each module may be implemented separately or may be implemented by integrating multiple modules. For the functional description of each module of the training device for the erase network, reference may be made specifically to the corresponding description of the training method for the erase network in the embodiment shown in fig. 8, which is not repeated herein.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
The electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods of image processing and the training methods of erasing a network as provided by embodiments of the present disclosure.
Compared with the prior art, the electronic equipment realizes the erasure of the first format characters in the document image to be processed through the end-to-end erasure network; meanwhile, the mixed image characteristics are determined by using the pixels corresponding to the first format characters and other pixels outside the pixels corresponding to the first format characters, so that decoupling of the segmentation head network and the erasing network is realized, the segmentation head network and the erasing network are guaranteed to take their own roles, and the accuracy of the final erasing image output by the erasing network obtained through training is improved.
The readable storage medium is a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method of image processing and a training method of erasing a network as provided by embodiments of the present disclosure.
Compared with the prior art, the readable storage medium realizes the erasure of the first format characters in the document image to be processed through the end-to-end erasure network; meanwhile, the mixed image characteristics are determined by using the pixels corresponding to the first format characters and other pixels outside the pixels corresponding to the first format characters, so that decoupling of the segmentation head network and the erasing network is realized, the segmentation head network and the erasing network are guaranteed to take their own roles, and the accuracy of the final erasing image output by the erasing network obtained through training is improved.
The computer program product comprises a computer program which, when executed by a processor, implements a method of image processing and a training method of erasing a network as provided by embodiments of the present disclosure.
Compared with the prior art, the computer program product realizes the erasure of the first format characters in the document image to be processed through an end-to-end erasure network; meanwhile, the mixed image characteristics are determined by using the pixels corresponding to the first format characters and other pixels outside the pixels corresponding to the first format characters, so that decoupling of the segmentation head network and the erasing network is realized, the segmentation head network and the erasing network are guaranteed to take their own roles, and the accuracy of the final erasing image output by the erasing network obtained through training is improved.
Fig. 11 illustrates a schematic block diagram of an example electronic device 1100 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 11, the apparatus 1100 includes a computing unit 1101 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1102 or a computer program loaded from a storage unit 1108 into a Random Access Memory (RAM) 1103. In the RAM1103, various programs and data required for the operation of the device 1100 can also be stored. The computing unit 1101, ROM1102, and RAM1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.
Various components in device 1100 are connected to I/O interface 1105, including: an input unit 1106 such as a keyboard, a mouse, etc.; an output unit 1107 such as various types of displays, speakers, and the like; a storage unit 1108, such as a magnetic disk, optical disk, etc.; and a communication unit 1109 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 1109 allows the device 1100 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 1101 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1101 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 1101 performs the respective methods and processes described above, such as the method of image processing and the training method of the erasure network. For example, in some embodiments, the method of image processing and the training method of the erasure network may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1108. In some embodiments, some or all of the computer programs may be loaded and/or installed onto device 1100 via ROM1102 and/or communication unit 1109. When the computer program is loaded into the RAM1103 and executed by the computing unit 1101, one or more steps of the above-described method of image processing and training method of erasing a network can be performed. Alternatively, in other embodiments, the computing unit 1101 may be configured by any other suitable means (e.g., by means of firmware) to perform the method of image processing and the training method of the erasure network.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (20)

1. A method of image processing, comprising:
acquiring a document image to be processed;
the method comprises the steps that a first format word in a document image to be processed is erased by inputting the document to be processed into a pre-trained erasing network, and a final erasing image corresponding to the document image to be processed is obtained;
the erasing network, the dividing head network and the discriminator network form an erasing network model, and the erasing network is obtained by training the erasing network model in advance by using a sample document image pair; the sample document image pair comprises a sample document image and a label image corresponding to the sample document image, wherein the label image is erased with characters in a first format;
The erasure network model is trained at least according to the content loss function and the style loss function; the content loss function and the style loss function are obtained according to the mixed image characteristics of the sample document image and the label image; and the mixed image features are obtained according to the pixels corresponding to the first format characters in the sample document image and other pixels except the pixels corresponding to the first format characters in the image output by the erasing network.
2. The method of claim 1, wherein the label image is an image of the sample document image from which text of a first format is erased and a background portion of the text-removed portion of the sample document image is subjected to a shadow removal operation;
the step of erasing the first format text in the document image to be processed by inputting the document to be processed into a pre-trained erasing network to obtain a final erasing image corresponding to the document image to be processed, comprising the following steps:
and inputting the document to be processed into a pre-trained erasing network, erasing characters in a first format in the document image to be processed, and performing shadow removing operation on a background part except for a character part in the document image to be processed to obtain a final erasing image corresponding to the document image to be processed.
3. The method of claim 1, wherein the erasure network comprises a feature extraction sub-network, a coarse erasure sub-network, a fine erasure sub-network;
the step of erasing the first format text in the document image to be processed by inputting the document to be processed into a pre-trained erasing network to obtain a final erasing image corresponding to the document image to be processed, comprising the following steps:
inputting the document image to be processed into a pre-trained feature extraction sub-network, and extracting features of the document image to be processed to obtain image features of the document image to be processed;
the image features are input into a pre-trained rough erasing sub-network, and first format characters in the document image to be processed are erased according to the image features, so that a rough erasing image corresponding to the document image to be processed is obtained;
and inputting the rough erasing image into a pre-trained fine erasing sub-network to erase the first format characters in the rough erasing image, so as to obtain a final erasing image corresponding to the document image to be processed.
4. A method according to claim 3, wherein the obtaining the image features of the document image to be processed by inputting the document image to be processed into a pre-trained feature extraction sub-network, performing feature extraction on the document image to be processed, comprises:
Inputting the document image to be processed into a pre-trained feature extraction sub-network, and extracting image features of the document image to be processed to obtain scale image features of different scales corresponding to the document image to be processed;
up-sampling the scale image features of the next scale corresponding to the scale image features to obtain up-sampled image features;
and acquiring the image characteristics of the document image to be processed according to the scale image characteristics and the up-sampling image characteristics.
5. The method of claim 4, wherein the coarse erasure sub-network includes an encoder corresponding to a feature extraction sub-network;
the step of obtaining the rough erasure image corresponding to the document image to be processed by inputting the image features into a rough erasure sub-network trained in advance and erasing the first format characters in the document image to be processed according to the image features comprises the following steps:
inputting the image features into the encoder, and up-sampling the image features to obtain scale image features with different scales;
downsampling the scale image features of the next scale corresponding to the scale image features to obtain downsampled image features;
Acquiring final image features corresponding to the scale image features according to the scale image features and the downsampled image features;
and acquiring a rough erasure image corresponding to the document image to be processed according to the final image characteristics.
6. The method of claim 3, wherein the fine erasure sub-network comprises a decoder and an encoder;
the erasing the first format text in the rough erasing image by inputting the rough erasing image into a pre-trained fine erasing sub-network, and obtaining a final erasing image corresponding to the document image to be processed, which comprises the following steps:
the rough erasure image is input into the decoder, so that first scale image features of different scales corresponding to the rough erasure image are obtained;
up-sampling the first scale image features of the next scale corresponding to the first scale image features to obtain first up-sampled image features;
acquiring image features corresponding to the rough erasure image according to the first scale image features and the first up-sampling image features;
inputting the image features corresponding to the rough erasure image into the encoder, and up-sampling the image features corresponding to the rough erasure image to obtain second-scale image features with different scales;
Downsampling the second scale image features of the next scale corresponding to the second scale image features to obtain second downsampled image features;
acquiring final image features corresponding to the second scale image features according to the second scale image features and the second downsampled image features;
and acquiring a final erasing image corresponding to the document image to be processed through the final image characteristics.
7. The method of claim 1, wherein the document image to be processed is an image of a document having at least a first format text and a second format text.
8. The method of claim 7, wherein the first formatted text is handwritten formatted text; the second format text is printed body format text.
9. A training method of an erasure network, comprising:
acquiring a sample document image pair, wherein the sample document image pair comprises a sample document image and a label image corresponding to the sample document image, in which the characters of a first format are erased;
training an erasure network model according to the sample document image pair to obtain an erasure network;
the erasure network, the segmentation head network and the discriminator network form the erasure network model; the erasing network is used for erasing the first format characters in the sample document image input into the erasing network, and obtaining a final erasing image corresponding to the sample document image; the segmentation head network is used for acquiring a mask corresponding to the first format text in the sample document image; the discriminator network is used for judging whether the first format characters in the final erased image are erased or not;
The erasure network model is trained at least according to the content loss function and the style loss function; the content loss function and the style loss function are obtained according to the mixed image characteristics of the sample document image and the label image; and the mixed image features are obtained according to the pixels corresponding to the first format characters in the sample document image and other pixels except the pixels corresponding to the first format characters in the image output by the erasing network.
10. The method of claim 9, wherein the label image is an image of the sample document image from which text of a first format has been erased and from which a background portion of the text portion has been removed is subjected to a shadow removal operation;
training the erasing network model according to the sample document image pair to obtain an erasing network, including:
inputting the sample document image into the erasing network, erasing characters in a first format in the sample document image, and performing shadow removing operation on a background part except for the character part in the sample document image to obtain a final erasing image corresponding to the sample document image;
and training the erasure network model according to the final erasure image and the label image to obtain an erasure network.
11. The method of claim 9, wherein the erasure network comprises a feature extraction sub-network, a coarse erasure sub-network, a fine erasure sub-network;
the characteristic extraction sub-network is used for extracting characteristics of a sample document image input into the characteristic extraction sub-network, and obtaining image characteristics of the sample document image;
the rough erasing sub-network is used for erasing the first format characters in the sample document image according to the image characteristics of the sample document image input into the rough erasing sub-network, and obtaining a rough erasing image corresponding to the sample document image;
the fine erasing sub-network is used for erasing the first format characters in the rough erasing image according to the rough erasing image input into the fine erasing sub-network, and obtaining a final erasing image corresponding to the sample document image.
12. The method of claim 11, wherein the feature extraction sub-network is to:
extracting image features of the sample document image to obtain scale image features of different scales corresponding to the sample document image;
up-sampling the scale image features of the next scale corresponding to the scale image features to obtain up-sampled image features;
And acquiring the image characteristics of the sample document image according to the scale image characteristics and the up-sampling image characteristics.
13. The method of claim 12, wherein the coarse erasure sub-network includes an encoder corresponding to a feature extraction sub-network;
the encoder is used for:
up-sampling the image features to obtain scale image features with different scales;
downsampling the scale image features of the next scale corresponding to the scale image features to obtain downsampled image features;
acquiring final image features corresponding to the scale image features according to the scale image features and the downsampled image features;
and acquiring a rough erasure image corresponding to the sample document image according to the final image characteristics.
14. The method of claim 11, wherein the fine erasure sub-network comprises a decoder and an encoder;
the decoder is used for:
the rough erasure image is input into the decoder, so that first scale image features of different scales corresponding to the rough erasure image are obtained;
up-sampling the first scale image features of the next scale corresponding to the first scale image features to obtain first up-sampled image features;
Acquiring image features corresponding to the rough erasure image according to the first scale image features and the first up-sampling image features;
the encoder is used for:
up-sampling the image features corresponding to the rough erasure image to obtain second-scale image features with different scales;
downsampling the second scale image features of the next scale corresponding to the second scale image features to obtain second downsampled image features;
acquiring final image features corresponding to the second scale image features according to the second scale image features and the second downsampled image features;
and acquiring a final erasing image corresponding to the sample document image through the final image characteristics.
15. The method of claim 9, wherein the sample document image is an image of a document having at least a first format text and a second format text.
16. An apparatus for image processing, comprising:
the image module is used for acquiring a document image to be processed;
the reasoning module is used for erasing the first format characters in the document image to be processed by inputting the document to be processed into a pre-trained erasing network, so as to obtain a final erasing image corresponding to the document image to be processed;
The erasing network, the dividing head network and the discriminator network form an erasing network model, and the erasing network is obtained by training the erasing network model in advance by using a sample document image pair; the sample document image pair comprises a sample document image and a label image corresponding to the sample document image, wherein the label image is erased with characters in a first format;
the erasure network model is trained at least according to the content loss function and the style loss function; the content loss function and the style loss function are obtained according to the mixed image characteristics of the sample document image and the label image; and the mixed image features are obtained according to the pixels corresponding to the first format characters in the sample document image and other pixels except the pixels corresponding to the first format characters in the image output by the erasing network.
17. A training apparatus for erasing a network, comprising:
the sample module is used for acquiring a sample document image pair, wherein the sample document image pair comprises a sample document image and a label image corresponding to the sample document image, and the label image is erased with characters in a first format;
the training module is used for training the erasing network model according to the sample document image pair to acquire an erasing network;
The erasure network, the segmentation head network and the discriminator network form the erasure network model; the erasing network is used for erasing the first format characters in the sample document image input into the erasing network, and obtaining a final erasing image corresponding to the sample document image; the segmentation head network is used for acquiring a mask corresponding to the first format text in the sample document image; the discriminator network is used for judging whether the first format characters in the final erased image are erased or not;
the erasure network model is trained at least according to the content loss function and the style loss function; the content loss function and the style loss function are obtained according to the mixed image characteristics of the sample document image and the label image; and the mixed image features are obtained according to the pixels corresponding to the first format characters in the sample document image and other pixels except the pixels corresponding to the first format characters in the image output by the erasing network.
18. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8 or the method of any one of claims 9-15.
19. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8 or the method of any one of claims 9-15.
20. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8 or the method according to any one of claims 9-15.
CN202310659387.0A 2023-06-05 2023-06-05 Image processing method, network model training method, device, equipment and medium Pending CN116863017A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310659387.0A CN116863017A (en) 2023-06-05 2023-06-05 Image processing method, network model training method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310659387.0A CN116863017A (en) 2023-06-05 2023-06-05 Image processing method, network model training method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN116863017A true CN116863017A (en) 2023-10-10

Family

ID=88222384

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310659387.0A Pending CN116863017A (en) 2023-06-05 2023-06-05 Image processing method, network model training method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN116863017A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117274438A (en) * 2023-11-06 2023-12-22 杭州同花顺数据开发有限公司 Picture translation method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117274438A (en) * 2023-11-06 2023-12-22 杭州同花顺数据开发有限公司 Picture translation method and system
CN117274438B (en) * 2023-11-06 2024-02-20 杭州同花顺数据开发有限公司 Picture translation method and system

Similar Documents

Publication Publication Date Title
CN109522816B (en) Table identification method and device and computer storage medium
CN108446698B (en) Method, device, medium and electronic equipment for detecting text in image
CN110084172B (en) Character recognition method and device and electronic equipment
US20220189083A1 (en) Training method for character generation model, character generation method, apparatus, and medium
CN114429637B (en) Document classification method, device, equipment and storage medium
CN114550177A (en) Image processing method, text recognition method and text recognition device
JP2022550195A (en) Text recognition method, device, equipment, storage medium and computer program
CN114298900A (en) Image super-resolution method and electronic equipment
CN116863017A (en) Image processing method, network model training method, device, equipment and medium
CN114218889A (en) Document processing method, document model training method, document processing device, document model training equipment and storage medium
CN113657396B (en) Training method, translation display method, device, electronic equipment and storage medium
CN114998897B (en) Method for generating sample image and training method of character recognition model
CN115376137B (en) Optical character recognition processing and text recognition model training method and device
CN111767924A (en) Image processing method, image processing apparatus, electronic device, and storage medium
CN114842482B (en) Image classification method, device, equipment and storage medium
US20230005171A1 (en) Visual positioning method, related apparatus and computer program product
CN113591861B (en) Image processing method, device, computing equipment and storage medium
CN115937039A (en) Data expansion method and device, electronic equipment and readable storage medium
CN113038184B (en) Data processing method, device, equipment and storage medium
CN111160265B (en) File conversion method and device, storage medium and electronic equipment
CN113361536A (en) Image semantic segmentation model training method, image semantic segmentation method and related device
CN114120305A (en) Training method of text classification model, and recognition method and device of text content
US20200342249A1 (en) Optical character recognition support system
CN115147850B (en) Training method of character generation model, character generation method and device thereof
CN111626283B (en) Character extraction method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination