CN114842492A

CN114842492A - Key information extraction method and device, storage medium and electronic equipment

Info

Publication number: CN114842492A
Application number: CN202210472937.3A
Authority: CN
Inventors: 王少康; 马志国; 张飞飞
Original assignee: Beijing Dingshixing Education Consulting Co ltd
Current assignee: Beijing Dingshixing Education Consulting Co ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-08-02

Abstract

The disclosure relates to a key information extraction method, a key information extraction device, a storage medium and electronic equipment, and belongs to the technical field of natural language processing. The method comprises the following steps: extracting visual features and text semantic features of a plurality of initial text boxes in an image to be processed; obtaining node characteristics of a plurality of initial text boxes according to the visual characteristics and the text semantic characteristics; clustering the node characteristics of the initial text boxes based on a graph convolution model to obtain respective corresponding categories of the initial text boxes, wherein the graph convolution model is clustered by adopting frequency spectrum domain convolution; and determining a target text box with necessary key information from the plurality of initial text boxes according to the corresponding category of the plurality of initial text boxes. By using the key information extraction method provided by the disclosure, necessary key information can be extracted from the image to be processed, so that a worker only needs to examine the necessary key information without examining irrelevant information, and the examination efficiency of the worker is improved.

Description

Key information extraction method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for extracting key information, a storage medium, and an electronic device.

Background

In recent years, along with the rapid development of knowledge economy, the educational market shows a good growth situation, and meanwhile, great working pressure is brought to teachers of all schools.

For example, in large schools, when recruiting teachers, recruiters are required to check personal certificates of recruiters, such as teacher qualifications and qualifications, and when the number of the recruiters is large, the number of the personal certificates of the recruiters, such as the teacher qualifications and the qualifications, is large, so that the manual checking of the large number of the personal certificates is time-consuming and labor-consuming, and the efficiency is low.

Disclosure of Invention

The present disclosure is directed to a method, an apparatus, a storage medium, and an electronic device for extracting key information, so as to solve the above technical problems.

In order to achieve the above object, a first aspect of the embodiments of the present disclosure provides a method for extracting key information, where the method includes:

extracting visual features and text semantic features of a plurality of initial text boxes in an image to be processed;

obtaining node characteristics of a plurality of initial text boxes according to the visual characteristics and the text semantic characteristics;

clustering the node characteristics of the initial text boxes based on a graph convolution model to obtain respective corresponding categories of the initial text boxes, wherein the graph convolution model adopts frequency spectrum domain convolution for clustering;

and determining a target text box with necessary key information from the plurality of initial text boxes according to the categories corresponding to the plurality of initial text boxes respectively.

Optionally, the extracting visual features of a plurality of initial text boxes in the image to be processed includes:

and extracting visual features and text semantic features of the plurality of initial text boxes based on a HRNet model.

Optionally, before extracting the visual feature and the text semantic feature of the initial text box in the image to be processed, the method includes:

identifying the image to be processed, and acquiring a plurality of text boxes of the image to be processed;

marking the text box with the key information in the plurality of text boxes to obtain an initial text box with a first preset label;

and extracting an initial text box with the first preset label based on the graph convolution model.

Optionally, the graph convolution model is obtained by training through the following steps:

and training the initial model by taking Chinese, English and a specified national language as training samples to obtain the graph convolution model.

Optionally, the image to be processed is obtained by:

enhancing the image quality of the input image based on the generative countermeasure network to obtain a first image;

and correcting the first image to obtain the image to be processed.

Optionally, the correcting the first image to obtain the image to be processed includes:

and removing the background area of the first image to obtain the image to be processed.

fitting a plurality of vertexes of the first image according to the plurality of edges of the first image to obtain a plurality of vertexes of the first image;

and obtaining the image to be processed according to the plurality of top points of the first image and the plurality of edges of the first image.

According to a second aspect proposed by an embodiment of the present disclosure, there is provided a key information extraction apparatus, including:

the extraction module is used for extracting visual features and text semantic features of a plurality of initial text boxes in the image to be processed;

the node characteristic determining module is used for obtaining node characteristics of a plurality of initial text boxes according to the visual characteristics and the text semantic characteristics;

the clustering module is used for clustering the node characteristics of the initial text boxes based on a graph convolution model to obtain respective corresponding categories of the initial text boxes, and the graph convolution model is clustered by adopting a frequency spectrum domain convolution;

and the key information extraction module is used for determining a target text box with necessary key information from the plurality of initial text boxes according to the type of the image to be processed.

According to a third aspect proposed by an embodiment of the present disclosure, there is provided a non-transitory computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the steps of the key information extraction method provided by any one of the first aspect of the embodiments of the present disclosure.

According to a fourth aspect proposed by an embodiment of the present disclosure, there is provided an electronic apparatus including:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the key information extraction method provided by any one of the first aspect of the embodiments of the present disclosure.

By the technical scheme, the target text box is obtained from the plurality of initial text boxes, so that the key information in the obtained target text box is necessary key information, the necessary key information is necessary key information related to the confirmation of the licensee and the certificate type, and the worker can determine the card type and the licensee information of the image to be processed only by checking the necessary key information without checking information unrelated to the card type and the licensee information, so that the checking efficiency of the worker is improved.

The convolution model can adopt frequency spectrum domain convolution for clustering, and the type of the initial text box to be predicted at present can be predicted through the initial text boxes of any number of neighborhoods instead of predicting the type of the initial text box by adopting a fixed number of neighborhoods, so that the type of the initial text box can be predicted through a smaller number of neighborhoods, the speed of predicting the type of the initial text box is improved, the type of the initial text box can be predicted through a larger number of neighborhoods, and the accuracy of predicting the type of the initial text box is improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flow chart illustrating steps of a key information extraction method according to an exemplary embodiment of the present disclosure;

FIG. 2 is a diagram illustrating one type of teacher qualifications in accordance with an exemplary embodiment of the present disclosure;

FIG. 3 is a diagram illustrating an initial text box with key information in a teacher certificate according to an exemplary embodiment of the present disclosure;

FIG. 4 is a diagram illustrating a target text box with necessary key information in accordance with an exemplary embodiment of the present disclosure;

fig. 5 is a schematic diagram of a HRNet network shown in an exemplary embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a node and convolution kernel as shown in an exemplary embodiment of the present disclosure;

FIG. 7 is a schematic diagram illustrating an exemplary embodiment of the present disclosure for determining vertices of an image to be processed;

fig. 8 is a block diagram illustrating a key information extraction apparatus according to an exemplary embodiment of the present disclosure;

fig. 9 is a block diagram of an electronic device shown in an exemplary embodiment of the present disclosure.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, and are not intended to limit the present disclosure.

It should be noted that all actions of acquiring signals, information or data in the present disclosure are performed under the premise of complying with the corresponding data protection regulation policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.

Please refer to fig. 1, which illustrates a key information extraction method, including the following steps:

step S11: and extracting visual features and text semantic features of a plurality of initial text boxes in the image to be processed.

In this step, after the image to be processed is subjected to image recognition, a plurality of text boxes of the image to be processed are obtained, where the initial text box refers to an initial text box with a first preset tag in the plurality of text boxes.

Specifically, the initial text box is obtained by the following steps:

substep A1: and identifying the image to be processed, and acquiring a plurality of text boxes of the image to be processed.

The image to be processed may be a card data image such as a teacher certificate, a qualification certificate, and the like, and when the image to be processed is identified, an OCR (Optical Character Recognition) may be used to detect and identify a text line in the image to be processed, and provide a spatial position relationship of each text box in the image to be processed and a Character identification result.

Illustratively, after the teacher's qualification certificate, which is an image to be processed, is recognized, the spatial position relationship of the text boxes of "bearer", "zhangsan", "sex", "girl", and the like and the text recognition result are obtained as shown in fig. 2.

Substep A2: and marking the text box with the key information in the plurality of text boxes to obtain an initial text box with a first preset label.

The character recognition results of the text boxes may be key information or non-key information, the key information refers to information capable of indicating the role of the image to be processed, and the non-key information refers to information incapable of indicating the role of the image to be processed.

For example, as shown in fig. 2, the image to be processed is a teacher's qualifications, and the text recognition results such as those of the teacher's qualifications of Zhang III, women, family X, senior middle school teacher in the image to be processed can indicate that the image to be processed is the senior middle school teacher's qualifications of Zhang III, and these text recognition results are the key information; and the character recognition results in the image to be processed according to the regulation of XXX, the provenance, the sex, the birth year and month, the nationality and the like are irrelevant to the qualification certificate of the teacher indicating that the image to be processed is Zhang III, and the character recognition results are non-key information.

When the plurality of text boxes are marked, the text boxes with the key information can be marked with a first preset label; and the text box with non-key information is marked with a second preset label. In this way, an initial text box with a first preset label and a non-initial text box with a second preset label can be obtained.

Specifically, since the key information in each image to be processed is different, the first preset label of each text box with the related key information needs to be labeled one by one; and the non-key information in each image to be processed is approximately the same, the text box of the non-key information can be marked with a second preset label based on the preset template.

For example, for a plurality of teacher qualifications, the displayed text information is the same in each different teacher qualifications according to the non-key information of the regulation of XXX, the prover, the gender and the birth year and month, so when labeling the text box with the non-key information in the image to be processed of the teacher qualifications, the text box with the non-key information can be labeled based on the preset template.

For example, the preset template a marks a non-initial text box of non-key information in a teacher qualification certificate, the preset template B marks a non-initial text box of non-key information in an architect qualification certificate, and the preset template C marks a non-initial text box of non-key information in an agent qualification certificate.

Specifically, when the preset template is adopted for marking, the preset template is provided with a plurality of preset text boxes, each preset text box is provided with respective non-key information, and each preset text box is marked with a second preset label. Therefore, whether the non-key information on the preset template is consistent with the non-key information on the image to be processed or not can be compared, and if so, a second preset label of a preset text box on the preset template is marked on the text box in the image to be processed, so that the image to be processed has an initial text box with the second preset label.

Substep A3: and extracting an initial text box with the first preset label based on the graph convolution model.

After the initial text box with the first preset label and the non-initial text box with the second preset label are obtained, the non-initial text box with the second preset label may be screened out through a Graph volume model (GCN model) to extract the initial text box with the first preset label from the plurality of text boxes, and finally, the plurality of initial text boxes with the first preset label shown in fig. 3 are screened out from the plurality of initial text boxes shown in fig. 2.

Specifically, the key information in the plurality of training images and a first preset label corresponding to the key information may be used as a training sample to train the initial model; and then training the initial model by taking the non-key information in the training image and a second preset label corresponding to the non-key information as training samples to obtain the graph convolution model.

After the trained image volume model receives the image to be processed, the initial text box corresponding to the key information in the image to be processed can be identified according to the first preset label, and the non-initial text box corresponding to the non-key information in the image to be processed can also be identified according to the second preset label. After the initial text box and the non-initial text box are distinguished in the image to be processed, the initial text box with the first preset label is extracted.

In the process of training the graph convolution model, not only Chinese character information is input into the graph convolution model for training, but also English or character information of a designated national language (such as Tibetan) and the like can be used as a training sample for training the initial model, so that the obtained graph convolution model not only extracts an initial text box of a Chinese image to be processed, but also can extract an initial text box of a minority image to be processed, such as English, Tibetan and the like, and the universality of the graph convolution model is improved.

In this step, the visual feature of the initial text box includes spatial position information of the initial text box, and the text semantic feature of the initial text box includes the semantic meaning of the key information. The visual features and text semantic features of the initial text box can be extracted through an HRNet model:

in the related art, a feature map of an image to be processed is output through a UNet model; and mapping the text region coordinates of the initial text box to each text region in the feature map to obtain the visual feature and text semantic feature of each initial text box.

Specifically, the UNet model is divided into an encoding part and a decoding part, and the UNet model obtains a feature map of an image to be processed by the following steps: the encoding part performs down-sampling on the image to be processed, the size of the image to be processed which is input at the beginning is 224x224, and four times of down-sampling are performed to obtain four images with different sizes, namely 112x112, 56x56, 28x28 and 14x 14; the decoding part is used for image splicing, convolution and up-sampling, for example, up-sampling a 14x14 feature map to obtain a 28x28 feature map, performing channel splicing on the 28x28 feature map and a 28x28 feature map obtained by down-sampling, performing convolution and up-sampling on the spliced image to obtain a 56x56 feature map, performing splicing, convolution and up-sampling on the 56x56 feature map, and outputting the feature map with the same size as the image to be processed after four times of up-sampling.

In this process, the UNet model has the following drawbacks:

1. the encoding part in the UNet model is used for acquiring the context information of each text region in the image to be processed and determining the text semantic features of each text region based on the context information; the decoding part is used for positioning the spatial position of each text region in the image to be processed to obtain the visual characteristic of each text region.

The more encoding parts and decoding parts in the UNet model, the lower the accuracy of the location of the spatial position of the text region; the fewer the encoding part and the decoding part are, the less the acquired context information is, the lower the accuracy of the obtained text semantic features is, and therefore, by adopting the UNet model, the accurate visual features and text semantic features cannot be obtained simultaneously.

2. In the process of downsampling and then upsampling in the UNet network, the spatial position information of the initial text box can be lost, and the positioning capability of the initial text box is reduced.

The method comprises the steps that based on an HRNet model, a feature map of an image to be processed is output, and the text region coordinates of an initial text box are mapped to each text region in the feature map, so that the visual features and text semantic features of each initial text box are obtained; wherein the feature maps have different visual features.

Specifically, as shown in fig. 5, each block in fig. 5 represents each feature map, and the larger the size of the feature map, the higher the resolution, and the smaller the size of the feature map, the lower the resolution.

As can be seen from fig. 5, the high-resolution network (the sub-network in the first row and the first column in fig. 5) is used as the initial network, and then the multi-resolution sub-networks are connected in parallel to perform multi-scale fusion, so as to finally obtain the rightmost sub-network. In the last column of feature maps in the rightmost sub-network, the information received by each feature map is from a plurality of sub-networks connected in parallel in front, so that the feature maps on the rightmost side can obtain rich high-resolution and rich low-resolution information, and further the visual features and text semantic features of the text region of the image to be processed are more accurate.

In the initial text box, the text semantic information of the text area represented by the high-resolution feature map is more; the spatial position information of the text region represented by the low-resolution feature map is more, and after the rightmost feature map is obtained based on the high-resolution network, the rightmost feature map has abundant high-resolution features and abundant low-resolution features, so that the obtained visual features and text semantic features of the feature map of the image to be processed are more accurate.

Visual features and text semantic features of a plurality of text regions in the image to be processed are extracted through the HRNet model, so that the obtained text regions have rich high-resolution information and rich low-resolution information, and after the initial text boxes are respectively mapped to different text regions, key information in the initial text boxes can also have rich high-resolution information and rich low-resolution information. The more the high-resolution information is, the more the text semantic information of the initial text box is, the more the low-resolution information is, and the more the visual features of the initial text box are, so that the obtained visual features and text semantic features of the plurality of initial text boxes are richer, and the initial text boxes represented based on the richer visual features and text semantic features are more accurate.

Step S12: and obtaining node characteristics of a plurality of initial text boxes according to the visual characteristics and the text semantic characteristics.

In this step, the visual features and the text semantic features of the initial text box are fused by a kronecker product to obtain fused features, and the fused features are the node features.

After the node features are obtained, structural features of each initial text box are obtained through a multi-modal graph reasoning model, the structural features refer to weights of edges between each initial text box, and the weights are used for representing relative spatial position distances (including relative horizontal distances and relative vertical distances) between each text box.

Step S13: and clustering the node characteristics of the plurality of initial text boxes based on a graph convolution model to obtain respective corresponding categories of the plurality of initial text boxes, wherein the graph convolution model performs clustering by adopting frequency spectrum domain convolution.

After the image to be processed is input into the graph convolution model, the graph convolution model not only needs to screen out the initial text boxes with the first preset labels from the plurality of text boxes, but also needs to cluster the plurality of initial text boxes to obtain the categories corresponding to the plurality of initial text boxes.

Specifically, in the process of clustering a plurality of initial text boxes, not only the node characteristics of the initial text boxes but also the structural characteristics of the initial text boxes are considered. Therefore, the node features and the structural features of each initial text box are required to be used as input of the graph convolution model, so that the graph convolution model learns the relationship between each initial text box, and then classifies each initial text box.

In the related art, the graph volume model is classified by using time domain convolution (also referred to as spatial convolution), and when the time domain volume classification is used, the classification of the current initial text box is determined according to a plurality of adjacent initial text boxes of the current initial text box to be predicted.

For example, as shown in fig. 6, a node corresponding to each cell in the picture is equivalent to each initial text box in the image to be processed, and when convolving the node, the node has a fixed domain size (for example, the convolution kernel of 3 × 3 on the right side of fig. 6 convolves the a node in the neighborhood of 8 on the left side of fig. 6), and meanwhile, there is also a fixed convolution order, and the convolution kernel is used to perform convolution from the upper left corner to the lower right corner of the a node in the neighborhood of 8.

In the process, a neighborhood of the node A needs to be constructed first, and a fixed number of neighborhood nodes are selected according to the expected size of the probability of the neighborhood nodes being selected; then, sequencing the neighborhood nodes according to the probability expectation of the neighborhood node being selected; and finally, convolving the sorted neighborhood nodes and the node A through a convolution kernel to obtain the class of the node A.

However, this approach to construct neighbors relies on a fixed number of neighbors, for example, 8 neighbors must be used by the node a shown in fig. 6 to obtain the class of the node a, which is slow to predict.

The graph convolution model is a clustering method that uses convolution in a frequency spectrum domain, and the convolution clustering in the frequency spectrum domain can be performed on any number of neighborhoods to predict the category of a node (initial text box) to be predicted currently.

By way of example, with spectrum domain convolution, the class of the a node can be obtained by means of 8 neighborhoods, and can also be obtained based on 4 neighborhoods, and the prediction speed is faster than that of time domain convolution.

Step S14: and determining a target text box with necessary key information from the plurality of initial text boxes according to the corresponding categories of the plurality of initial text boxes.

In this step, after the plurality of initial text boxes are identified by the graph volume model, the respective categories of the plurality of initial text boxes can be obtained, for example, the category of "zhang san" is a bearer, the category of "woman" is a gender, the category of "X clan" is a race, and the like.

In each obtained category, unnecessary key information also exists, wherein the unnecessary key information refers to repeated information or information irrelevant to the confirmation of the licensee information and the certificate type; the essential key information refers to information related to confirming the bearer and the type of the certificate.

For example, referring to fig. 3, the initial text box of "senior middle teacher qualifications" appears repeatedly on the left and right sides in the image to be processed, and only one of the "senior middle teacher qualifications" is required for the user to confirm the certificate type. Therefore, it is necessary to compare the definitions of the key information displayed by the two initial text boxes, and as can be seen from fig. 2, the definition of the initial text box of the left "high-level middle teacher qualification" is smaller than that of the initial text box of the right "high-level middle teacher qualification", so that the initial text box of the left "high-level middle teacher qualification" with lower definition is screened out, and the target text box of the right "high-level middle teacher qualification" with higher definition is obtained.

Clarity refers to whether the key information represented within the initial text box is displayed clearly. When the definition of the key information in the initial text box representing the same key information is different, taking the key information in the initial text box with lower definition as unnecessary key information and taking the key information in the initial text box with higher definition as necessary key information; in the case where the sharpness of the key information in the initial text boxes representing the same key information is the same, one of the two initial text boxes may be selected as the target text box.

For another example, as shown in fig. 3, the initial text box of the right "accreditation institution official seal" is independent of the certificate holder information and the certificate type, and therefore, the right "accreditation institution official seal" can be determined as non-critical information and can be screened out from a plurality of initial text boxes.

By removing the duplicate information and the information that is not related to the identification of the bearer information and the certificate type, the initial text boxes such as "zhang san," "senior middle school teacher qualification," "certification institution," and "6/1/2002" in fig. 3 can be removed, and a plurality of target text boxes as shown in fig. 4 can be obtained.

According to the key information extraction method provided by the disclosure, on the first hand, based on an HRNet model, the obtained feature graph has rich high-resolution information and low-resolution information so as to obtain rich visual features and text semantic features, after the visual features and the text semantic features are increased, the initial text box can be more accurately expressed, the category of the initial text box is accurately determined according to the accurate initial text box, and the problem of inaccurate category prediction caused by inaccurate initial text box is solved from the source; the graph convolution model in the second aspect can be clustered by adopting frequency spectrum domain convolution, and because the frequency spectrum domain convolution can predict the category of the initial text box to be predicted currently through any number of neighborhood initial text boxes instead of predicting the category of the initial text box by adopting a fixed number of neighborhoods, the category of the initial text box can be predicted through a fewer number of neighborhoods, the speed of predicting the category of the initial text box is increased, the category of the initial text box can be predicted through a greater number of neighborhoods, and the accuracy of predicting the category of the initial text box is increased; and in the third aspect, the non-initial text boxes are preliminarily screened from the plurality of text boxes, so that the obtained text information in the plurality of initial text boxes is key information, then the target text boxes are obtained from the plurality of initial text boxes, further, the obtained key information in the target text boxes is necessary key information, and finally, the necessary key information related to the confirmation of the certifier and the certificate type is presented, so that the staff only need to check the necessary key information without checking useless information, and the checking efficiency of the staff is greatly improved.

In a possible embodiment, the quality of the card image taken by the user is uneven, which may interfere with the image recognition, and therefore, the card image taken by the user needs to be further processed before the image is recognized to obtain the image to be processed.

Specifically, the method comprises the following steps:

step S21: and enhancing the image quality of the input image based on the generative countermeasure network to obtain a first image.

In this step, through a generative confrontation network (GAN network), the overall features or local features of the image can be purposefully enhanced, the originally unclear card image is converted into a clear image to be processed, or the features of interest in the card image are enhanced, the features of no interest are suppressed, the difference between different features in the card image is enlarged, and thus the visual effect of the image is improved.

Step S22: and correcting the first image to obtain the image to be processed.

In this step, the correcting includes removing a background region of the first image to obtain the image to be processed, and fitting a plurality of vertexes of the first image according to the plurality of edges of the first image to obtain a plurality of vertexes of the first image; and obtaining the image to be processed according to the plurality of top points of the first image and the plurality of edges of the first image.

The first image may be a card image such as a teacher certificate, an agent certificate, and the like.

The card images are different in shape, may be parallelogram or trapezoid due to different photographing angles of users; the card image may also be tilted, or rotated. For the parallelogram card image, global perspective transformation can be adopted to remove the background of the parallelogram card image; for the trapezoidal card image, affine transformation can be adopted to remove the background of the trapezoidal card image; for the tilted and rotated card image, affine transformation can also be adopted to process the tilted and rotated card image so as to obtain a card image with a correct direction.

Through global perspective transformation and affine transformation, the backgrounds of the card images with different angles and different shapes can be removed to obtain the image to be processed, the background of the card image photographed by a user is avoided, image recognition is interfered, and the accuracy of the image recognition is improved.

When the two certificates are overlapped, the rest card images can be arranged below the card image photographed by the user. As shown in fig. 7, the top card image is an image to be processed, the lower card image is the rest of the images that are inadvertently captured by the user, and in order to avoid interference of the lower image with the upper image, the top point fitting of the card image is performed through global perspective transformation, so as to obtain an image to be processed.

Specifically, four edges of the card image may be extended, four vertices (e.g., A, B, C, D four vertices shown in fig. 7) where each edge intersects with each other are used as the four vertices of the card image, and then the image to be processed is obtained according to the four vertices and the four edges, and the rest of the images except the image to be processed are deleted to remove the interference of the image below.

Or extending four edges of the card image, taking four vertexes intersected with each edge as four vertexes of the card image, and fitting the four vertexes through the image model to obtain the image to be processed.

For example, four vertices and images corresponding to the four vertices may be input into the image model for training, so that after the image model receives the four vertices, the image to be processed may be fit and output according to the four vertices.

By removing the background of the first image, the interference of the background of the first image on the image can be avoided; by correcting the first image, the interference of the image below the first image on the first image can be avoided, and the situation of wrong extraction of key information caused by the recognition of the image below the first image during the image recognition is avoided.

Referring to fig. 8, an exemplary embodiment of the disclosure shows a block diagram of a key information extraction apparatus 1300, where the apparatus 1300 includes: an extraction module 1301, a node characteristic determination module 1302, a clustering module 1303 and a key information extraction module 1304.

The extraction module 1301 is used for extracting visual features and text semantic features of a plurality of initial text boxes in an image to be processed;

a node characteristic determining module 1302, configured to obtain node characteristics of a plurality of initial text boxes according to the visual characteristics and the text semantic characteristics;

a clustering module 1303, configured to cluster the node features of the multiple initial text boxes based on a graph convolution model, to obtain respective corresponding categories of the multiple initial text boxes, where the graph convolution model performs clustering by using spectrum domain convolution;

and a key information extraction module 1304, configured to determine a target text box with necessary key information from the multiple initial text boxes according to the type of the image to be processed.

Optionally, the extraction module 1301 includes:

and the first extraction module is used for extracting the visual features and the text semantic features of the plurality of initial text boxes based on the HRNet model.

Optionally, the apparatus 1300 further comprises:

the identification module is used for identifying the image to be processed and acquiring a plurality of text boxes of the image to be processed;

the marking module is used for marking the text box with the key information in the plurality of text boxes to obtain an initial text box with a first preset label;

and the second extraction module is used for extracting the initial text box with the first preset label based on the graph convolution model.

Optionally, the apparatus 1300 further comprises:

and the training module is used for training the initial model by taking Chinese, English and specified national language as training samples to obtain the graph convolution model.

Optionally, the apparatus 1300 further comprises:

the first image determining module is used for enhancing the image quality of the input image based on the generative confrontation network to obtain a first image;

and the correction module is used for correcting the first image to obtain the image to be processed.

Optionally, the correction module comprises:

and the first correction processing module is used for removing the background area of the first image to obtain the image to be processed.

Optionally, the correction module further comprises:

the fitting module is used for fitting a plurality of vertexes of the first image according to the plurality of edges of the first image to obtain a plurality of vertexes of the first image;

and the second correction processing module is used for obtaining the image to be processed according to the plurality of top points of the first image and the plurality of edges of the first image.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The present disclosure also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the key information extraction method provided by the present disclosure.

The present disclosure also provides an electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the key information extraction method provided by the present disclosure.

Fig. 9 is a block diagram illustrating an electronic device 1900 in accordance with an example embodiment. For example, the electronic device 1900 may be provided as a server. Referring to fig. 9, an electronic device 1900 includes a processor 1922, which may be one or more in number, and a memory 1932 for storing computer programs executable by the processor 1922. The computer program stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processor 1922 may be configured to execute the computer program to perform the key information extraction method described above.

Additionally, electronic device 1900 may also include a power component 1926 and a communication component 1950, the power component 1926 may be configured to perform power management of the electronic device 1900, and the communication component 1950 may be configured to enable communication, e.g., wired or wireless communication, of the electronic device 1900. In addition, the electronic device 1900 may also include input/output (I/O) interfaces 1958. The electronic device 1900 may operate based on an operating system, such as Windows Server, stored in memory 1932 ^TM ，Mac OS X ^TM ，Unix ^TM ，Linux ^TM And so on.

In another exemplary embodiment, there is also provided a computer readable storage medium including program instructions which, when executed by a processor, implement the steps of the key information extraction method described above. For example, the non-transitory computer readable storage medium may be the memory 1932 that includes program instructions executable by the processor 1922 of the electronic device 1900 to perform the key information extraction method described above.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned key information extraction method when executed by the programmable apparatus.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that the various technical features described in the above embodiments may be combined in any suitable manner without contradiction, and the disclosure does not separately describe various possible combinations in order to avoid unnecessary repetition.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. A method for extracting key information, the method comprising:

and determining a target text box with necessary key information from the plurality of initial text boxes according to the corresponding categories of the plurality of initial text boxes.

2. A key information extraction method as claimed in claim 1, wherein said extracting visual features of a plurality of initial text boxes in the image to be processed comprises:

3. A key information extraction method as claimed in claim 1, wherein before extracting visual features and text semantic features of an initial text box in the image to be processed, the method comprises:

4. A key information extraction method as claimed in claim 1, wherein the graph convolution model is obtained by training through the following steps:

5. A key information extraction method according to claim 1, wherein the image to be processed is obtained by:

and correcting the first image to obtain the image to be processed.

6. A key information extraction method as claimed in claim 5, wherein the step of correcting the first image to obtain the image to be processed comprises:

7. A key information extraction method as claimed in claim 5, wherein the step of correcting the first image to obtain the image to be processed comprises:

8. A key information extraction apparatus, characterized in that the apparatus comprises:

the clustering module is used for clustering the node characteristics of the initial text boxes based on a graph convolution model to obtain respective corresponding categories of the initial text boxes, and the graph convolution model carries out clustering by adopting frequency spectrum domain convolution;

9. A non-transitory computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 7.