CN112801099A

CN112801099A - Image processing method, device, terminal equipment and medium

Info

Publication number: CN112801099A
Application number: CN202010490243.3A
Authority: CN
Inventors: 曹浩宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-06-02
Filing date: 2020-06-02
Publication date: 2021-05-14
Anticipated expiration: 2040-06-02
Also published as: CN112801099B

Abstract

The embodiment of the application discloses an image processing method, an image processing device, terminal equipment and a medium, wherein the method comprises the following steps: the image to be processed can be converted into a text sequence, key fields and value fields included in the text sequence are determined, the key fields and the value fields are combined pairwise to obtain at least one key value text sequence, the characteristic information of the key fields and the value fields in each key value text sequence is obtained, the key fields and the value fields in each key value text sequence are paired according to the characteristic information, and the structured text corresponding to the image to be processed is output based on the pairing result of the key fields and the value fields in each key value text sequence. By converting the image data into the structured data, more valuable reference data can be provided for a user, and the practicability and intelligence of the image processing scheme are improved.

Description

Image processing method, device, terminal equipment and medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to the field of computer technologies, and in particular, to an image processing method, an image processing apparatus, a terminal device, and a computer storage medium.

Background

With the rapid development of the mobile internet, the application of the image character recognition technology is increasingly wide. The image Character Recognition technology may be, for example, an Optical Character Recognition (OCR) technology, and the OCR technology mainly performs electronic scanning on an input image and extracts Character information from the input image, so that a burden of a user to input corresponding Character information is reduced, the user can store and edit the corresponding Character information conveniently, and a large amount of human resources can be saved. However, the OCR technology recognizes only one editable character string, which is less valuable to the user, and the image character recognition technology is less practical.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, terminal equipment and a medium, and by converting image data into structured data, more valuable reference data can be provided for a user, so that the practicability and intelligence of an image processing scheme are improved.

In one aspect, an embodiment of the present application provides an image processing method, where the method includes:

converting an image to be processed into a text sequence;

performing key value classification on the text sequence, and determining a key field and a value field included in the text sequence based on a key value classification result;

combining the key fields and the value fields pairwise to obtain at least one key value text sequence, wherein each key value text sequence comprises one key field and one value field;

acquiring characteristic information of a key field and a value field in each key value text sequence;

pairing the key fields and the value fields in each key value text sequence according to the characteristic information;

and outputting a structured text corresponding to the image to be processed based on the pairing result of the key field and the value field in each key value text sequence.

In another aspect, an embodiment of the present application provides an image processing apparatus, including:

the conversion unit is used for converting the image to be processed into a text sequence;

the processing unit is used for carrying out key value classification on the text sequences, determining key fields and value fields included in the text sequences based on key value classification results, combining the key fields and the value fields in pairs to obtain at least one key value text sequence, obtaining characteristic information of the key fields and the value fields in each key value text sequence, and carrying out pairing processing on the key fields and the value fields in each key value text sequence according to the characteristic information;

and the output unit is used for outputting the structured text corresponding to the image to be processed based on the pairing result of the key field and the value field in each key value text sequence.

Correspondingly, the embodiment of the application also provides a terminal device, which comprises an output device, a processor and a storage device; storage means for storing program instructions; and the processor is used for calling program instructions and executing the image processing method.

Accordingly, the embodiment of the present application also provides a computer storage medium, in which program instructions are stored, and when the program instructions are executed, the computer storage medium is used for implementing the image processing method.

The method and the device for processing the image can convert the image to be processed into the text sequence, perform key value classification on the text sequence, and determine the key field and the value field included in the text sequence based on the key value classification result. Further, the key fields and the value fields can be combined pairwise to obtain at least one key value text sequence, characteristic information of the key fields and the value fields in each key value text sequence is obtained, the key fields and the value fields in each key value text sequence are paired according to the characteristic information, and then the structured text corresponding to the image to be processed is output based on the pairing result of the key fields and the value fields in each key value text sequence. By converting the image data into the structured data, more valuable reference data can be provided for a user, and the practicability and intelligence of the image processing scheme are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative work.

Fig. 1 is an application scene diagram of an image processing method provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of an image processing method according to an embodiment of the present application;

FIG. 3 is a diagram illustrating the results of a named entity model provided by an embodiment of the present application;

fig. 4 is a scene illustration diagram of key value classification by a named entity model according to an embodiment of the present disclosure;

fig. 5 is a schematic flowchart of key value pairing according to an embodiment of the present application;

FIG. 6 is a schematic flowchart of another image processing method provided in the embodiments of the present application;

fig. 7a is an application scene diagram of another image processing method provided in the embodiment of the present application;

FIG. 7b is a diagram of an application scenario of another image processing method provided in an embodiment of the present application;

fig. 7c is an application scene diagram of another image processing method provided in the embodiment of the present application;

fig. 8 is a schematic view of a scenario of a feature extraction and pairing process provided in an embodiment of the present application;

FIG. 9 is a schematic diagram of a scenario for determining an aspect ratio of a key field and a value field in an image to be processed according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

AI (Artificial Intelligence) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is an integrated technique in computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer vision is a science for researching how to make a machine look, and further means that a camera and a computer are used for replacing human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

NLP (natural Language processing) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it has a close relation with the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

OCR is an image character recognition technology in computer vision, is a process of electronically scanning an input image and extracting characters from the input image, can reduce the burden of inputting corresponding character information by a user, is convenient for the user to store and edit the corresponding character information, and is beneficial to saving a large amount of manpower resources. But the result of OCR recognition is only a string of editable characters, which does not contain any structured information, and it is less valuable, and it is exactly structured data that is really valuable to the user. For example, in the identification of an enterprise license, a user needs an identification result of an important field such as an enterprise name and a legal person rather than a simple character identification result. Therefore, how to convert image data into structured data becomes an important research direction. The structured data can be understood as a structured result of Key Value pairs.

For example, referring to fig. 1, assuming that the image to be processed is a license image as shown in the left image of fig. 1, the structured data corresponding to the license image may be as shown in the right image of fig. 1, and the key value pairs included in the structured data are respectively as shown in table 1.

TABLE 1

Currently, image data can be converted into structured data by OCR structured methods, which can typically include image feature-based template registration methods and text feature-based custom field detection methods.

The template registration method based on the image features can map an image to be structured to a template image according to anchor point features (such as fixed characters, field distribution and the like) in the template image, and extract a structured result of a corresponding field according to position information, so that conversion from image data to structured data is realized. The method has the following disadvantages:

1. the requirements on image quality and character recognition results are high, and problems such as rotation, perspective and distortion are difficult to deal with;

2. the method is applicable to OCR (optical character recognition) structured scenes of fixed format images only. For example, the template image involved in the template configuration method is a resident identification card image, which indicates that the template configuration method can only perform corresponding processing on the image to be processed with the same format as the resident identification card image. The image of the identity document with other formats, such as a passport, a driving license and the like, or the image of other business types, such as a social security card, a business license, a value-added tax invoice and the like, cannot be detected or has extremely low detection accuracy.

The customized detection method based on the text features can detect the position of a field to be structured through a special text field detector, and then obtain a field identification result through a text recognizer, so that the conversion from image data to structured data is realized. The method has the following disadvantages: the method is only suitable for the OCR structured scene of the fixed format image, and in the OCR structured scene of the non-fixed format image, because the positions of the fields to be structured in the images of different formats are different, the positions of the fields to be structured in the images of different formats can not be accurately detected by a special text field detector, and the accuracy of structured data extraction is further influenced.

Therefore, the existing OCR structuring methods can not be applied to the structured scene of the unfixed format image, and the application range is limited. Based on this, the embodiment of the present application provides an image processing method, which may be executed by a terminal device or a server, where the terminal device may access an image processing platform or run an application corresponding to the image processing platform, and the server may be a server corresponding to the image processing platform. The terminal device here may be any of the following: portable devices such as smart phones, tablets, laptops, etc., and desktop computers, etc. Correspondingly, the server may be a server providing a corresponding service for the image processing platform, and the server may be an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers.

In the embodiment of the application, a user can acquire an image to be processed (for example, start a camera device to shoot an image) or upload the image to be processed through an image processing platform, and trigger text recognition of the image to be processed. In this case, the terminal device or the server may convert the image to be processed into a text sequence, perform key value classification on the text sequence, and determine a key field and a value field included in the text sequence based on a result of the key value classification. Further, the key fields and the value fields can be combined pairwise to obtain at least one key value text sequence, characteristic information of the key fields and the value fields in each key value text sequence is obtained, the key fields and the value fields in each key value text sequence are paired according to the characteristic information, and then the structured text corresponding to the image to be processed is output based on the pairing result of the key fields and the value fields in each key value text sequence, wherein each key value text sequence comprises one key field and one value field. By converting the image data into the structured data, more valuable reference data can be provided for a user, and the practicability and intelligence of the image processing scheme are improved.

It can be seen that, in the process of converting image data into structured data, the embodiment of the application does not depend on a template image or a special text field detector, the accuracy of the output result is not affected by the change of the format of the image to be processed, the corresponding structured data can be more accurately extracted from the image with the non-fixed format, the method is suitable for the structured scenes of various images with the non-fixed format, and compared with the existing OCR structured method, the application range is wider.

In addition, in the pairing processing process of the key fields and the value fields in the text sequence, each key field in the text sequence and all the value fields are combined pairwise to obtain at least one key value text sequence, the key value text sequence is taken as a processing unit subsequently, each time of obtaining and pairing processing of the feature information can be realized, the targeted objects are only the key fields and the value fields in the key value text sequence, interference of other field information does not exist, the accuracy of the pairing result is improved, the accuracy of the subsequent structured text output based on the pairing processing result is improved, and the accuracy of extracting the corresponding structured data from the image to be processed is improved.

It is understood that the image with the fixed format can be understood as a single format, in particular to an image with one format; the image of the non-fixed format can be understood as an image of a plurality of formats. For example, the image processing method provided by the embodiment of the application can accurately extract the corresponding structured text from the image data corresponding to the resident identification image, can also accurately extract the corresponding structured text from the image data corresponding to the passport image, and can be applied to OCR structured scenes of images of various formats. The image to be processed in the embodiment of the present application may include any one of the following: the image of the business license, the image of the value-added invoice, the image of the identification card or the image of the social security card is not particularly limited.

In an embodiment, the process of converting the image to be processed into the text sequence may be implemented based on an OCR method, and the key value classification and the pairing process may be implemented based on an NLP method, and based on this, another image processing method is proposed in the embodiment of the present application, which may be executed by the aforementioned terminal device or server, please refer to fig. 2, and the image processing method may include the following steps S201 to S204:

s201: image input, OCR recognition and layout processing. A user can acquire an image to be processed or upload the image to be processed through the image processing platform and trigger text recognition of the image to be processed. In this case, the terminal device or the server may identify the image to be processed input to the image processing platform through OCR, and perform layout processing on the identification result to obtain the text sequence. The specific implementation of the typesetting processing can be as follows: and splicing the discrete characters included in the recognition result to form paragraph text.

S202: and (4) key value classification. In a specific implementation, the text sequence may be subjected to key value classification to obtain a key value classification result, where the key value classification result includes classification tags of each character in the text sequence, each classification tag is used to indicate a character type of a corresponding character and a position of the character in a field to which the character belongs, and the position includes any one or more of the following: a start position, a middle position, and an end position, the character type including any one or more of: key characters, value characters, and other characters.

The key value classification can be carried out by calling a named entity model or a single character classification model based on a position. In a specific implementation, the named entity model or the single character classification model based on the position may be trained in advance through a large number of text sequences labeled with classification tags between calling the named entity model or the single character classification model based on the position, the text sequence obtained after step S201 is executed may be input into the named entity model or the single character classification model based on the position after training, and the named entity model or the single character classification model based on the position outputs a key value classification result including the classification tag of each character in the text sequence.

The single character classification model based on the position can be a model combining a CUTIE (Convolutional Universal Text Information Extractor) and a classifier, wherein the CUTIE is used for extracting characteristic Information of each character in a Text sequence and inputting the characteristic Information into the classifier; the classifier is used for classifying each field according to the characteristic information of each character and determining the classification label of each character. The named entity model can be a model combining a Bi-directional Long Short-Term Memory (Bi-LSTM) and a Conditional Random Field (CRF), and a network structure diagram of the model is shown in fig. 3 and includes a text input module, a feature extraction module, a semantic model and a key value classification module. The text input module is used for inputting a text sequence; the feature extraction module is used for extracting the vector features of each character in the text sequence and inputting the vector features of each character into the semantic model; a semantic model for outputting the probability [ p1, p 2.. pi ] that the vector of each character belongs to the respective class label, which can also be understood as outputting the probability [ p1, p 2.. pi ] that each character belongs to the respective class label; and key value classification, which is used for determining the classification label with the highest probability as the target classification label of each character based on the probability that each character output by the semantic model belongs to each classification label, wherein i is an integer larger than 0. For example, if the probability that the character "yes" belongs to the classification label 1 is the highest and the probability that the character "no" belongs to the classification label 2 is the highest in the text sequence, the classification label 1 may be determined as the target classification label of the character "no" and the classification label 2 may be determined as the target classification label of the character "no". The category labels may be as shown in table 2. It can be seen that the named entity model provided by the embodiment of the application is different from the traditional named entity identification method, on one hand, the output tag results are only Key and Value, and the convergence and the algorithm effect are better; on the other hand, the method does not depend on the layout information and the image characteristics of the text to be recognized, and has wider application field.

TABLE 2

Illustratively, referring to fig. 4, when the content of the text sequence is "date 2019 is established", the named entity model shown in fig. 3 is called for key value classification, and the named entity model can output classification labels of 9 characters "become", "immediately", "day", "period", "2", "0", "1", "9" and "year" in the text sequence, which are respectively: "B-Key", "I-Key", "E-Key", "B-Value", "I-Value", and "E-Value".

Further, after obtaining a key value classification result including the classification label of each character in the text sequence, the key field and the value field included in the text sequence may be determined based on the key value classification result. Specifically, characters in the text sequence, the character types of which are key characters and which belong to the same field, may be integrated into a key field, and characters in the text sequence, the character types of which are value characters and which belong to the same field, may be integrated into a value field, as indicated by the classification labels of the respective characters. Exemplarily, in conjunction with fig. 4 and table 2, it can be seen that, by the indication of the classification label of each character in the text sequence "standing date 2019", the characters "standing", "day" and "period" are all key characters and all belong to the same field, and then the characters "standing", "day" and "period" can be integrated into the key field "standing date". Accordingly, the characters "2", "0", "1", "9" and "year" are all value characters and all belong to the same field, and the characters "2", "0", "1", "9" and "year" may be integrated into the value field "2019".

S203: feature extraction and key value pairing. As a possible implementation manner, referring to fig. 5, after determining the key fields and the value fields in the text sequence, the terminal device or the server may extract feature information of each key field and value field in the text sequence, perform pairing processing on each key field and value field based on the feature information of each key field and value field, and determine a relationship pair category to which each key field and each value field belong in pairs, where the relationship pair category includes a key-value pair category or other categories. Further, based on the relationship pair category to which each key field and each value field belong in pairs, a pairing result for the key fields and the value fields in the text sequence is output, and the pairing result indicates the relationship pair category to which each key field and each value field belong in pairs in the text sequence.

The specific way of pairing and processing each key field and value field based on the characteristic information of each key field and value field may be: and calling a matching model to analyze the characteristic information of each key field and value field, and determining the pairing result of each key field and value field in the text sequence. The characteristic information herein may include any one or more of the following: semantic information, location information, and image information. The position information may be position information (e.g., position coordinates or row and column information) of each key field and value field in the image to be processed, and the image information may be image information of an image area where each key field and value field is located in the image to be processed, such as an image RGB value, a gray scale value, a pixel value, and the like.

In specific implementation, semantic information of each key field and each value field can be extracted through an NLP model (such as a semantic representation model Bert, a transform) and the like; the position information of each key field and value field in the image to be processed can be determined through a position information extraction model; the image information of the respective key fields and value fields in the image to be processed can be determined by an image information extraction model. The position information extraction model and the image information extraction model may both be CNN (Convolutional Neural Network), and the CNN may be trained by different training samples, so as to obtain the position information extraction model and the image information extraction model. Specifically, the training sample corresponding to the position information extraction model comprises a sample field and a sample image marked with the position information of the sample field; the training sample corresponding to the image information extraction model comprises a sample field and a sample image marked with the image information of the image area where the sample field is located.

In one embodiment, after determining the key fields and the value fields included in the text sequence, each key field and each value field in the text sequence may be input into the NLP model, and semantic information of each key field and each value field may be extracted through the NLP model. Inputting each key field and value field in the image and text sequence to be processed into a position information extraction model and an image information extraction model obtained by training, determining the position information of each key field and value field in the image to be processed through the position information extraction model, and determining the image information of the image area where each key field and value field are located in the image to be processed through the image information extraction model.

It can be understood that, when the feature information includes semantic information, position information, and image information, the extraction processes of the semantic information, the position information, and the image information are three independent processes, and there is no sequential order in execution, and the extraction processes may be performed in parallel, which is not specifically limited in this application.

The matching model may be a classification model (e.g., random forest, linear regression, logistic regression, decision tree, SVM (Support Vector Machine), neural Network, etc.) or a Graph model (e.g., GCN (Graph connected Network, etc.).

Taking the classification model as an example, after determining the feature information of all key fields and value fields included in the text sequence, the feature information of each key field and the feature information of all value fields may be merged and input into the classification model one by one, the classification model may determine whether each key field and each value field are a relationship pair, and output a pairing result, where the pairing result indicates the category of the relationship pair to which the key field and the value field belong.

For example, assuming that all key fields and value fields included in the text sequence are respectively a key field 1, a key field 2, a key field 3, a value field 1, a value field 2, and a value field 3, in this case, after determining the feature information of all key fields and value fields included in the text sequence, the feature information of the key field 1 and the feature information of all value fields may be firstly merged and input into a classification model, and it is determined through the classification model whether the feature information of the key field 1 and each value field are a relationship pair, if it is determined that the key field 1 and the value field 2 are a relationship pair, a pairing result may be output, and the pairing result indicates that the relationship pair category to which the key field 1 and the value field 2 belong is a key-value pair category. And by analogy, the feature information of the key field 2 and the feature information of all the value fields can be combined and input into the classification model, the feature information of the key field 3 and the feature information of all the value fields can be combined and input into the classification model, and the corresponding pairing result is output.

It can be understood that, if the relationships between the key fields and the value fields in the text sequence are in a one-to-one correspondence relationship, in the process of pairing each key field and each value field through the classification model, if a relationship pair between the target key field and the target value field is determined (that is, the category of the relationship pair to which the target key field and the target value field belong is the key-value pair category), then when the value fields of the relationship pair with other key fields are subsequently determined, it is not necessary to input the feature information of all the value fields into the classification model, and only the feature information of the value fields other than the target value field is input, which is beneficial to reducing the calculation amount of the classification model and improving the pairing efficiency of the key fields and the value fields. For example, all the key fields and value fields included in the text sequence are respectively the key field 1, the key field 2, the key field 3, the value field 1, the value field 2 and the value field 3, before the key field 1 and the value field 2 are determined to be a relational pair by the classification model, and then subsequently when the value field of the relational pair with the key field 2 is determined, the feature information of the key field 2, the value field 1 and the value field 3 can be merged and input into the classification model without inputting the feature information of all the value fields.

Alternatively, when the value field that is a relational pair with the last key field of all the key fields is determined, the value field that is not paired with any key field of all the value fields may be directly determined as the value field that is a relational pair with the last key field, without being determined by a classification model. For example, all the key fields and value fields included in the text sequence are respectively the key field 1, the key field 2, the key field 3, the value field 1, the value field 2, and the value field 3, before which the key field 1 and the value field 2 are determined to be a relationship pair by the classification model, and the key field 2 and the value field 3 are determined to be a relationship pair, in which case, the key field 3 and the value field 1 can be directly determined to be a relationship pair.

As another possible implementation, after determining the key field and the value field in the text sequence, the terminal device or the server may combine the key field and the value field two by two to obtain at least one key-value text sequence. Further, the feature information of the key field and the value field in each key value text sequence can be obtained, and the key field and the value field in each key value text sequence are paired according to the feature information to obtain a pairing result of the key field and the value field in each key value text sequence, wherein each key value text sequence comprises a key field and a value field, and the pairing result indicates the relationship pair type of the key field and the value field in each key value text sequence.

S204: and (5) structuring output. In a specific implementation, the terminal device or the server may determine, based on the indication of the pairing result, a target value field paired with each key field in the text sequence, and display each key field and a target value field corresponding to each key field in a page of the video processing platform in an associated manner. The relevance display may be, for example, displaying each key field and its corresponding target value field in the same row (e.g., as shown in the right diagram of fig. 1).

Based on the above description, the embodiment of the present application proposes yet another image processing method, which may be executed by the above-mentioned terminal device or server, please refer to fig. 6, and the image processing method may include the following steps S601-S605:

s601, converting the image to be processed into a text sequence. The image to be processed may be a plurality of business images, and may include any one of the following: a business license image, a value-added invoice image, an identification card image, or a social security card image. In one embodiment, a user may capture an image to be processed (e.g., turn on a camera to capture the image) or upload the image to be processed through the image processing platform, and trigger text recognition of the image to be processed. Illustratively, referring to fig. 7a, the image to be processed is a business license image, the user opens the camera to take the business license image through the image processing platform, the image processing platform may show the business license image taken by the user in a page (as shown in the right diagram of fig. 7 a), and the user may trigger a "text recognition" function button in the page by clicking, pressing, or speaking, so as to trigger text recognition on the business license image.

Further, after detecting that the user triggers text recognition of the image to be processed, the terminal device or the server may call the text detection model to perform text recognition on the acquired image to be processed, and process the text recognition result in a layout manner to obtain a text sequence corresponding to the image to be processed. The Text detection model may be, for example, An OCR Text detection model (e.g., EAST (An Efficient and Accurate Scene Text Detector)) or other neural network model for Text recognition.

The specific implementation manner of obtaining the text sequence corresponding to the image to be processed from the text recognition result of the typesetting processing may be as follows: and performing typesetting processing on the text recognition result, and splicing the discrete characters to form a paragraph text (namely a text sequence). The method is beneficial to quickly extracting effective information from the text sequence in the subsequent text sequence processing process.

S602, performing key value classification on the text sequence, and determining a key field and a value field included in the text sequence based on the key value classification result. The value field is a named entity in the text sequence, the key field is a text item corresponding to the named entity, the named entity is a person name, an organization name, a place name and other entities identified by names, and the more extensive entities can also comprise numbers, dates, currency, addresses and the like. Illustratively, it is assumed that the image to be processed is a license image shown in the right diagram of fig. 1, the "XX service limited company" in the license image is a named entity, and the "name" is a text item corresponding to the "XX service limited company".

In one embodiment, a named entity model or a single character classification model based on a position can be trained in advance through a large number of text sequences marked with classification labels, after an image to be processed is converted into a text sequence, the trained named entity model or the single character classification model based on the position can be called to perform key value classification on the text sequence, and a key value classification result of the classification label of each character in the text sequence is output. For example, the named entity model may be a Bi-LSTM and CRF combined model as shown in fig. 3, and a specific implementation of the named entity model to perform key value classification on a text sequence may be called, which may refer to the related description of step S202 in the foregoing embodiment, and is not described herein again.

Wherein, the text sequence may include a plurality of fields, each field includes one or more characters, each classification label (as shown in table 2 above) is used to indicate the character type of the character and the position of the character in the field, and the position includes any one or more of the following: a start position, a middle position, and an end position, the character type including any one or more of: key characters, value characters, and other characters. In this case, after determining the key value classification result including the classification label of each character in the text sequence through the named entity model or the single character classification model based on the location, the characters of which the character types are key characters and which belong to the same field in the text sequence may be integrated into a key field according to the indication of the classification label of each character, and the characters of which the character types are value characters and which belong to the same field in the text sequence may be integrated into a value field.

Exemplarily, it is assumed that the content of the text sequence is "success date 2019", and the classification labels of 9 characters "become", "do", "day", "period", "2", "0", "1", "9" and "year" in the text sequence are respectively: "B-Key", "I-Key", "E-Key", "B-Value", "I-Value", and "E-Value". In this case, the characters "yes", "immediately", "day" and "period" are all key characters and all belong to the same field by the instruction of the classification label of each of the 9 characters, and the characters "yes", "immediately", "day" and "period" can be integrated into the key field "established date". Accordingly, the characters "2", "0", "1", "9" and "year" are all value characters and all belong to the same field, and the characters "2", "0", "1", "9" and "year" may be integrated into the value field "2019".

S603, combining the key fields and the value fields pairwise to obtain at least one key value text sequence, wherein each key value text sequence comprises one key field and one value field.

S604, acquiring characteristic information of the key field and the value field in each key value text sequence, and performing pairing processing on the key field and the value field in each key value text sequence according to the characteristic information.

In an embodiment, after the feature information of the key field and the value field in each key value text sequence is obtained, the feature information of the key field and the value field in each key value text sequence may be input into a matching model, the feature information of the key field and the value field in each key value text sequence is analyzed through the matching model, and the key field and the value field in each key value text sequence are paired to obtain a pairing result of the key field and the value field in each key value text sequence. The pairing result indicates the relationship pair category to which the key field and the value field belong in each key value text sequence, and the relationship pair category comprises a key value pair category or other categories; the matching model may be a classification model (e.g., random forest, linear regression, logistic regression, decision tree, SVM (Support Vector Machine), neural Network, etc.) or a Graph model (e.g., GCN (Graph connected neural Network), etc.).

Taking a classification model as an example, for any key value text sequence, after determining the feature information of the key field and the value field in the any key value text sequence, combining the feature information of the key field and the value field in the any key value text sequence and inputting the combined feature information into the classification model, wherein the classification model can judge whether the key field and the value field in the any key value text sequence are a relationship pair or not based on the feature information of the key field and the value field in the any key value text sequence, and if so, determine that the relationship pair category to which the key field and the value field in the any key value text sequence belong is the key pair category; if not, determining that the relationship pair category to which the key field and the value field belong in any key value text sequence is other categories. Further, a pairing result for the key field and the value field in any key value text sequence may be output based on the determined relationship pair category to which the key field and the value field in any key value text sequence belong, where the pairing result indicates the relationship pair category to which the key field and the value field in any key value text sequence belong.

Exemplarily, it is assumed that all key fields and value fields included in a text sequence are respectively a key field 1, a key field 2, a value field 1, and a value field 2, and each key field and all value fields are combined pairwise to obtain 4 key value text sequences, where the key fields and the value fields included in each key value text sequence are shown in table 3. After determining the characteristic information of the key field and the value field in each key-value text sequence, the characteristic information of the key field 1 and the value field 1 in the key-value text sequence 1 may be first merged and input into the classification model, determining whether the key field 1 and the value field 1 in the key value text sequence 1 are a relationship pair through a classification model, if so, determining that the relationship pair type of the key field 1 and the value field 1 in the key value text sequence 1 is the key-value pair type, if not, it may be determined that the relationship pair category to which the key field 1 and the value field 1 belong in the key-value text sequence 1 is other category, and based on the determined relationship pair category to which the key field 1 and the value field 1 belong, a pairing result for the key field 1 and the value field 1 in the key-value text sequence 1 is output, the pairing result indicates the relationship pair category to which the key field 1 and the value field 1 belong in the key value text sequence 1. By analogy, the feature information of the key field 1 and the value field 2 in the key value text sequence 2 can be merged and input into the classification model, the feature information of the key field 2 and the value field 1 in the key value text sequence 3 can be merged and input into the classification model, and the feature information of the key field 2 and the value field 2 in the key value text sequence 4 can be merged and input into the classification model, so that pairing results for the key field 1 and the value field 2 in the key value text sequence 2, the key field 2 and the value field 1 in the key value text sequence 3, and the key field 2 and the value field 2 in the key value text sequence 4 can be obtained.

TABLE 3

Key value text sequence	Including a key field and a value field
		Key value text sequence 1	Key field 1 and value field 1
Key value text sequence 2	Key field 1 and value field 2
		Key value text sequence 3	Key field 2 and value field 1
Key value text sequence 4	Key field 2 and value field 2

The characteristic information may include any one or more of the following: the image processing method comprises semantic information, position information and attribute information of key fields and value fields in each key value text sequence, wherein the attribute information is used for representing field types of the key fields and the value fields in each key value text sequence, the field types comprise key field types or value field types, the position information is used for representing relative positions of the key fields and the value fields in each key value text sequence in an image to be processed, and the position information comprises position coordinates of the key fields and the value fields in each key value text sequence in the image to be processed or aspect ratio of the key fields and the value fields relative to the image to be processed.

It is understood that the key field and the value field generally have strong position correlation, and the position correlation mainly refers to the display position of the key field and the value field in the image to be processed, for example, the display position of the key field and the value field in the Chinese format in the image to be processed is generally: the key field is on the left and the value field is on the right (as shown in the left diagram of fig. 1), or the key field is above, the value field is below, and so on. In the embodiment of the application, the key fields and the value fields in the text sequence can be paired by combining the semantic information, the position information and the attribute information of the key fields and the value fields in each key value text sequence, so that the accuracy of the pairing result is further improved.

In an embodiment, the feature information includes semantic information, and a specific implementation manner of the terminal device or the server acquiring the semantic information of the key field and the value field in each key value text sequence may be as follows: and segmenting each key value text sequence according to the positions of the key fields and the value fields in each key value text sequence, and extracting the characteristics of each segmented key value text sequence through a semantic representation model to obtain the semantic information of the key fields and the value fields in each key value text sequence.

The above segmenting each key value text sequence according to the positions of the key field and the value field in each key value text sequence may include: and adding an input starting zone bit, an input ending zone bit, a starting zone bit of the key field, an ending zone bit of the key field, a starting zone bit of the value field and an ending zone bit of the value field in each key value text sequence according to the positions of the key field and the value field in each key value text sequence. In the subsequent process of extracting the features of the segmented key value text sequences through the semantic representation model, the semantic mark model can pay more attention to semantic information in key fields and value fields in each key value text sequence and is not influenced by other fields, and therefore the accuracy of the extracted semantic information is improved.

Exemplarily, referring to fig. 8, it is assumed that a certain key value text sequence includes a key field 1 and a value field 2, the characteristic information includes semantic information, the semantic representation model may be Bert, and the input start flag bit, the input end flag bit, the start flag bit of the key field, the end flag bit of the key field, the start flag bit of the value field, and the end flag bit of the value field are respectively shown in table 4. In this case, based on the positions of the key field 1 and the value field 2, an input start flag "[ Beg ]", an input End flag "[ End ]", a start flag "[ E1 ]", an End flag "[/E1 ]", a start flag "[ E2 ]", and an End flag "[/E2 ]" of the value field 2 may be added to the key-value text sequence, and further, the key-value text sequence to which the respective flags are added may be input to Bert, and semantic information of the key field 1 and semantic information of the value field 2 may be extracted by Bert. Further, semantic information of the key field 1 and semantic information of the value field 2 may be input into the classification model, the semantic information of the key field 1 and the semantic information of the value field 2 are analyzed by the classification model, and the key field 1 and the value field 2 are paired to obtain a pairing result of the key field 1 and the value field 2, where the pairing result indicates a relationship pair category to which the key field 1 and the value field 2 belong, and the relationship pair category includes a key value pair category (i.e., KV pair in fig. 8) and other categories (e.g., KK pair and others in fig. 8).

TABLE 4

Inputting the initial flag bit	[Beg]
		Start flag bit of key field	[E1]
End flag bit of key field	[/E1]]
		Start flag bit of value field	[E2]
Value wordEnd flag bit of segment	[/E2]
		Input end flag bit	[End]

In one embodiment, the image to be processed may be put into a rectangular planar coordinate system for analysis, and the feature information includes position information, the position information is used for characterizing the relative positions of the key fields and the value fields in each key value text sequence in the image to be processed, the position information includes the position coordinates of the key fields and the value fields in each key value text sequence in the image to be processed, and the position coordinates include a horizontal coordinate (i.e., an x-axis coordinate in the rectangular planar coordinate system) and a vertical coordinate (i.e., a y-axis coordinate in the rectangular planar coordinate system). In this case, the terminal device or the server invokes the text detection model to perform text recognition on the acquired image to be processed, and an obtained text recognition result includes not only the character string extracted from the image to be processed but also the position coordinate of each character in the character string in the image to be processed. Further, after determining all the key fields and the value fields included in the text sequence, the position coordinates of each character in each key field and the position coordinates of each character in each value field may be obtained from the text recognition result, and the position coordinates of each key field in the image to be processed may be determined according to the position coordinates of each character in each key field, and the position coordinates of each value field in the image to be processed may be determined according to the position coordinates of each character in each value field.

Specific embodiments of determining the position coordinates of each key field in the image to be processed according to the position coordinates of each character in each key field may include any one or more of the following: the position coordinate of the first character in each key field is determined as the position coordinate of each key field in the image to be processed, the position coordinate of the last character in each key field is determined as the position coordinate of each key field in the image to be processed, and the position coordinate of the center point of each key field is determined as the position coordinate of each key field in the image to be processed. For example, a certain key field is "true date", and includes 4 characters, and the position coordinates of each character are: (m1, n), (m2, n), (m3, n) and (m4, n), the position coordinates of the center point of the key field "standing date" may be ((m1+ m2+ m3+ m4)/2, n). Similarly, the position coordinates of the value field in the image to be processed can also be determined in a similar manner to the key field, and will not be described herein again.

Further, after determining the position coordinates of all the key fields and value fields included in the text sequence, the position coordinates of all the key fields and value fields included in the text sequence may be stored in a specified storage area (e.g., a local storage area of a terminal device or a server, a block chain, or a cloud storage area, etc.). The subsequent terminal device or the server may obtain, from the designated storage area, the position coordinates of the key field and the value field in each key value text sequence in the image to be processed, respectively, as the position information of the key field and the value field in each key value text sequence.

Or, the position information further includes an aspect ratio of the key field and the value field in each key value text sequence with respect to the image to be processed, in another embodiment, after the terminal device or the server acquires the position coordinates of each character in the key field and the position coordinates of each character in the value field from the text recognition result, the width w (w >0) and the height h (h >0) of the image to be processed may also be acquired, the width x0 and the height y0 of the key field are determined according to the position coordinates of the characters in the key field, and the width x1 and the height y1 of the value field are determined according to the position coordinates of the characters in the value field, so that the aspect ratio of the key field with respect to the image to be processed is calculated as: x0/w and y 0/h; the aspect ratio of the value field with respect to the image to be processed is: x1/w, y 1/h. Further, the aspect ratios of all the key fields and the value fields included in the determined text sequence with respect to the image to be processed may be stored in the above-mentioned designated storage area. The subsequent terminal device or the server may obtain, from the designated storage area, the aspect ratio of the key field and the value field in each key value text sequence with respect to the image to be processed, as the position information of the key field and the value field in each key value text sequence.

Exemplarily, referring to fig. 9, assuming that the width of the image to be processed is w and the height is h, K of "Kxy" in fig. 9 represents the key field, and the subscript "xy" represents the width and height of the corresponding key field in the image to be processed, respectively; in fig. 9, V of "Vxy" represents a value field, and subscript "xy" represents the width and height of the corresponding value field in the image to be processed, respectively. It can be seen that the terminal device or the server may determine the aspect ratio of all the key fields and value fields in the text sequence with respect to the image to be processed according to the width and height of all the key fields and value fields included in the text sequence in the image to be processed, and the width w and height h of the image to be processed.

The specific implementation of determining the width x0 and the height y0 of the key field according to the position coordinates of the characters in the key field may be: the difference value of the abscissa between the last character and the first character in the key field is determined as the width x0 of the key field, and the ordinate of any one character in the key field is determined as the height y0 of the key field. Accordingly, the manner of determining the width x1 and the height y1 of the value field according to the position coordinates of the characters in the value field may be similar to the key field and will not be described herein again.

For example, assuming that the unit of the width and height is cm, the key field is "name", the position coordinate of the first character "last name" is (4, 2), and the position coordinate of the second character "first name" is (6, 2), the width of the key field may be determined to be 2 cm, and the height may be determined to be 2 cm.

And S605, outputting a structured text corresponding to the image to be processed based on the pairing result of the key field and the value field in each key value text sequence. The structured text may refer to a key field displayed based on a certain display rule or display manner and a target value field paired with the key field.

In one embodiment, the pairing result indicates a relationship pair category to which the key field and the value field belong in each key value text sequence, and the relationship pair category includes a key value pair category or other categories. The terminal device or the server may determine, according to an indication of a pairing result of the key field and the value field in each key value text sequence, a target value field paired with each key field in the text sequence, where the target value field is a value field whose relation pair category to which the corresponding key field belongs in the text sequence is a key value pair category. Further, each key field and the target value field paired with each key field may be displayed according to a display rule.

Exemplarily, assuming that all key fields and value fields included in a text sequence are respectively a key field 1, a key field 2, a value field 1, and a value field 2, each key field and all value fields are combined pairwise to obtain 4 key value text sequences, where the key field and the value field included in each key value text sequence are as shown in table 3, and the pairing results of the key field and the value field in the 4 key value text sequences respectively indicate: the relationship pair type to which the key field 1 and the value field 1 belong is a key value pair type; the relationship pair type to which the key field 1 and the value field 2 belong is other types; the relationship pair category to which the key field 2 and the value field 1 belong is other category, and the relationship pair category to which the key field 2 and the value field 2 belong is key-value pair category. In this case, the terminal device or the server may determine that the target value field paired with the key field 1 in the text sequence is the value field 1, and the target value field paired with the key field 2 is the value field 2.

It is assumed that the target key field is any one of the key fields, and the target value field is a value field paired with the target key field. The display rule may include displaying the target key field and the target value field in the same row, and for example, assuming that the pairing of each key field and value field in the text sequence is as shown in table 5, the effect of displaying each key field and the target value field paired with each key field according to the display rule may be as shown in the right diagram of fig. 1.

TABLE 5

Key field	Paired value field
		Unified social credit code	91440300MA3EL54E2H
Legal representative	Plum X
		Name (R)	XX service Co Ltd
Residence	Shenzhen Shentian region XXX
		Type of body	Limited responsibility company (Natural exclusive resources)
Date of formation	26/06/2017

Alternatively, the display rule may include displaying the target key field and the target value field in adjacent rows, and the display row of the target key field is located before the display row of the target value field, for example, assuming that the pairing situation of each key field and the value field in the text sequence is shown in table 5, the effect of displaying each key field and the target value field paired with each key field according to the display rule may be shown in fig. 7 b.

In another embodiment, after determining the target value field paired with each key field in the text sequence based on the indication of the pairing result, the display mode of each key field and each value field in the image to be processed may be determined based on the position information of each key field and each value field in the image to be processed, and each key field and the target value field paired with each key field may be displayed according to the display mode. For example, referring to fig. 7c, the image to be processed is a license image, and each key field and the target value field paired with each key field may be displayed in a page of the image processing platform based on the display manner of each key field and value field in the image to be processed, and the display effect thereof is as shown in the right diagram of fig. 7 c. It can be seen that the display of the key fields and value fields in the page of the image processing platform is consistent with the display in the license image. By adopting the method, the user can conveniently and quickly position the target information required by the user from the output structured text, and the acquisition efficiency of the target information is improved.

In the embodiment of the application, the image to be processed can be converted into a text sequence, the text sequence is subjected to key value classification, and a key field and a value field included in the text sequence are determined based on a key value classification result. Further, the key fields and the value fields can be combined pairwise to obtain at least one key value text sequence, characteristic information of the key fields and the value fields in each key value text sequence is obtained, pairing processing is performed on the key fields and the value fields in each key value text sequence according to the characteristic information, then a structured text corresponding to the image to be processed is output based on the pairing result of the key fields and the value fields in each key value text sequence, and conversion from the image data to the structured data is achieved. On one hand, the method does not depend on a template image or a special text field detector, the accuracy of an output result is not influenced by the format change of the image to be processed, the corresponding structured data can be more accurately extracted from the image with the non-fixed format, the method is suitable for the structured scenes of various images with the non-fixed format, and the method is favorable for expanding the application range. On the other hand, each time the feature information is acquired and paired, the target is the key field and the value field in each key value text sequence, and the interference of other field information does not exist, so that the accuracy of the pairing result between each key field and each value field is improved, and the accuracy of extracting the corresponding structured data from the image to be processed is further improved.

The embodiment of the present application further provides a computer storage medium, in which program instructions are stored, and when the program instructions are executed, the computer storage medium is used for implementing the corresponding method described in the above embodiment.

Referring to fig. 10, a schematic structural diagram of an image processing apparatus according to an embodiment of the present application is shown, and the image processing apparatus according to the embodiment of the present application may be provided in the terminal device, or may be a computer program (including program codes) running in the terminal device.

In one implementation of the apparatus of the embodiment of the application, the apparatus includes the following structure.

A conversion unit 80 for converting the image to be processed into a text sequence;

the processing unit 81 is configured to perform key value classification on the text sequences, determine key fields and value fields included in the text sequences based on the key value classification result, combine the key fields and the value fields in pairs to obtain at least one key value text sequence, obtain feature information of the key fields and the value fields in each key value text sequence, and perform pairing processing on the key fields and the value fields in each key value text sequence according to the feature information;

and the output unit 82 is used for outputting the structured text corresponding to the image to be processed based on the pairing result of the key field and the value field in each key value text sequence.

In one embodiment, the characteristic information includes any one or more of: the method comprises the steps of semantic information, position information and attribute information of key fields and value fields in each key value text sequence, wherein the attribute information is used for representing field types of the key fields and the value fields in each key value text sequence, the field types comprise key field types or value field types, the position information is used for representing the relative positions of the key fields and the value fields in each key value text sequence in an image to be processed, and the position information comprises position coordinates of the key fields and the value fields in each key value text sequence in the image to be processed or the aspect ratio of the key fields and the value fields relative to the image to be processed.

In an embodiment, the feature information includes semantic information, and the processing unit 81 is specifically configured to perform segmentation processing on each key value text sequence according to positions of a key field and a value field in each key value text sequence; and performing feature extraction on each segmented key value text sequence through a semantic representation model to obtain semantic information of a key field and a value field in each key value text sequence.

In one embodiment, the processing unit 81 is further specifically configured to add an input start flag bit, an input end flag bit, a start flag bit of a key field, an end flag bit of a key field, a start flag bit of a value field, and an end flag bit of a value field in each key value text sequence according to the positions of the key field and the value field in each key value text sequence.

In an embodiment, the matching process is performed by calling a matching model, the matching result indicates a relationship pair category to which the key field and the value field in each key value text sequence belong, the relationship pair category includes a key value pair category or other categories, the output unit 82 is specifically configured to determine, according to an indication of the matching result of the key field and the value field in each key value text sequence, a target value field that is paired with each key field in the text sequence, and the target value field is a value field in the text sequence whose relationship pair category to which the corresponding key field belongs is the key value pair category; each key field and the target value field paired with each key field are displayed according to the display rule.

In one embodiment, the text sequence contains a plurality of fields, each field including one or more characters; the key value classification result comprises classification labels of all characters in the text sequence, and the classification labels are used for indicating the character types of the characters and the positions of the characters in the fields to which the characters belong; the location includes any one or more of: a start position, an intermediate position, and an end position; character types include any one or more of the following: key characters, value characters, and other characters.

In one embodiment, the processing unit 81 is specifically configured to integrate the characters in the text sequence, which are of the character type of key character and belong to the same field, into a key field and integrate the characters in the text sequence, which are of the character type of value character and belong to the same field, into a value field according to the indication of the classification label of each character.

In one embodiment, key value classification is performed by calling a named entity model or a single word classification model based on position, a value field is a named entity in a text sequence, and a key field is a text item corresponding to the named entity.

In an embodiment, the conversion unit 80 is specifically configured to invoke a text detection model to perform text recognition on the acquired image to be processed, and perform typesetting on the text recognition result to obtain a text sequence corresponding to the image to be processed.

In one embodiment, the image to be processed includes any one of: a business license image, a value-added invoice image, an identification card image, or a social security card image.

In the embodiment of the present application, the detailed implementation of the above units can refer to the description of relevant contents in the embodiments corresponding to the foregoing drawings.

The image processing device in the embodiment of the application can convert the image to be processed into the text sequence, perform key value classification on the text sequence, and determine the key field and the value field included in the text sequence based on the key value classification result. Further, the key fields and the value fields can be combined pairwise to obtain at least one key value text sequence, characteristic information of the key fields and the value fields in each key value text sequence is obtained, the key fields and the value fields in each key value text sequence are paired according to the characteristic information, then the structured text corresponding to the image to be processed is output based on the pairing result of the key fields and the value fields in each key value text sequence, and conversion from image data to structured data is achieved. The method does not depend on a template image or a special text field detector, the accuracy of an output result is not influenced by the format change of the image to be processed, corresponding structured data can be more accurately extracted from the image with the non-fixed format, the method is suitable for the structured scenes of various images with the non-fixed format, and the method is favorable for expanding the application range.

Referring to fig. 11 again, it is a schematic structural diagram of a terminal device according to an embodiment of the present application, where the terminal device according to the embodiment of the present application includes a power supply module and the like, and includes a processor 90, a storage device 91, an input device 92, and an output device 93. Data can be exchanged among the processor 90, the storage 91, the input device 92 and the output device 93, and the processor 90 realizes corresponding image processing functions.

The storage device 91 may include a volatile memory (volatile memory), such as a random-access memory (RAM); the storage device 91 may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a solid-state drive (SSD), or the like; the storage means 91 may also comprise a combination of memories of the kind described above.

The processor 90 may be a Central Processing Unit (CPU) 90. In one embodiment, processor 90 may also be a Graphics Processing Unit (GPU) 90. The processor 90 may also be a combination of a CPU and a GPU. In the terminal device, a plurality of CPUs and GPUs may be included as necessary to perform corresponding image processing.

The input device 92 may include a touch pad, fingerprint sensor, microphone, etc., and the output device 93 may include a display (LCD, etc.), speaker, etc.

In one embodiment, storage device 91 is used to store program instructions. The processor 90 may invoke program instructions to implement the various methods as described above in the embodiments of the present application.

In a first possible implementation manner, the processor 90 of the terminal device invokes a program instruction stored in the storage device 91, and is configured to convert an image to be processed into a text sequence, perform key value classification on the text sequence, determine a key field and a value field included in the text sequence based on a result of the key value classification, combine the key field and the value field two by two to obtain at least one key value text sequence, obtain feature information of the key field and the value field in each key value text sequence, perform pairing processing on the key field and the value field in each key value text sequence according to the feature information, and output a structured text corresponding to the image to be processed based on a result of pairing the key field and the value field in each key value text sequence.

In one embodiment, the feature information includes semantic information, and the processor 90 is specifically configured to segment each key value text sequence according to positions of a key field and a value field in each key value text sequence; and performing feature extraction on each segmented key value text sequence through a semantic representation model to obtain semantic information of a key field and a value field in each key value text sequence.

In one embodiment, the processor 90 is further specifically configured to add an input start flag, an input end flag, a start flag of a key field, an end flag of a key field, a start flag of a value field, and an end flag of a value field in each key value text sequence according to the positions of the key field and the value field in each key value text sequence.

In an embodiment, the matching process is performed by calling a matching model, the matching result indicates a relationship pair category to which the key field and the value field in each key value text sequence belong, and the relationship pair category includes a key value pair category or other categories, and the processor 90 is further specifically configured to determine, according to the indication of the matching result of the key field and the value field in each key value text sequence, a target value field to be matched with each key field in the text sequence, where the target value field is a value field in the text sequence whose relationship pair category to which the corresponding key field belongs is the key value pair category; each key field and a target value field paired with each key field are displayed by the output device 93 in accordance with a display rule.

In one embodiment, the processor 90 is specifically configured to integrate the characters in the text sequence, which are of the character type of key character and belong to the same field, into a key field and integrate the characters in the text sequence, which are of the character type of value character and belong to the same field, into a value field, as indicated by the classification label of each character.

In an embodiment, the processor 90 is further specifically configured to invoke a text detection model to perform text recognition on the acquired image to be processed, and perform layout processing on the text recognition result to obtain a text sequence corresponding to the image to be processed.

In the embodiment of the present application, the specific implementation of the processor 90 may refer to the description related to the embodiments corresponding to the foregoing drawings.

The terminal device in the embodiment of the application can convert the image to be processed into the text sequence, perform key value classification on the text sequence, and determine the key field and the value field included in the text sequence based on the key value classification result. Further, the key fields and the value fields can be combined pairwise to obtain at least one key value text sequence, characteristic information of the key fields and the value fields in each key value text sequence is obtained, pairing processing is performed on the key fields and the value fields in each key value text sequence according to the characteristic information, then a structured text corresponding to the image to be processed is output based on the pairing result of the key fields and the value fields in each key value text sequence, and conversion from the image data to the structured data is achieved. The method does not depend on a template image or a special text field detector, the accuracy of an output result is not influenced by the format change of the image to be processed, corresponding structured data can be more accurately extracted from the image with the non-fixed format, the method is suitable for the structured scenes of various images with the non-fixed format, and the method is favorable for expanding the application range.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

While the invention has been described with reference to a number of embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An image processing method, characterized in that the method comprises:

converting an image to be processed into a text sequence;

acquiring characteristic information of key fields and value fields in each key value text sequence;

and outputting the structured text corresponding to the image to be processed based on the pairing result of the key field and the value field in each key value text sequence.

2. The method of claim 1, wherein the characteristic information comprises any one or more of: the image processing method comprises the steps of obtaining semantic information, position information and attribute information of key fields and value fields in each key value text sequence, wherein the attribute information is used for representing field types of the key fields and the value fields in each key value text sequence, the field types comprise key field types or value field types, the position information is used for representing relative positions of the key fields and the value fields in each key value text sequence in the image to be processed, and the position information comprises position coordinates of the key fields and the value fields in each key value text sequence in the image to be processed or aspect ratio of the key fields and the value fields relative to the image to be processed.

3. The method of claim 1, wherein the feature information includes the semantic information, and wherein the obtaining the feature information of the key field and the value field in each key value text sequence includes:

performing segmentation processing on each key value text sequence according to the positions of the key field and the value field in each key value text sequence;

and performing feature extraction on each segmented key value text sequence through a semantic representation model to obtain semantic information of a key field and a value field in each key value text sequence.

4. The method of claim 2, wherein said slicing each key-value text sequence according to the position of the key field and the value field in said each key-value text sequence comprises:

and adding an input starting zone bit, an input ending zone bit, a starting zone bit of the key field, an ending zone bit of the key field, a starting zone bit of the value field and an ending zone bit of the value field in each key value text sequence according to the positions of the key field and the value field in each key value text sequence.

5. The method of claim 1, wherein the pairing process is performed by calling a matching model, the pairing result indicates a relationship pair category to which a key field and a value field in each key value text sequence belong, the relationship pair category includes a key value pair category or other categories, and the outputting the structured text corresponding to the image to be processed based on the pairing result of the key field and the value field in each key value text sequence includes:

according to the indication of the pairing result of the key field and the value field in each key value text sequence, determining a target value field paired with each key field in the text sequence, wherein the target value field is a value field of which the relation pair type to which the corresponding key field belongs in the text sequence is the key value pair type;

and displaying the key fields and the target value fields paired with the key fields according to a display rule.

6. The method of claim 1, wherein the text sequence contains a plurality of fields, each field comprising one or more characters;

the key value classification result comprises classification labels of all characters in the text sequence, and the classification labels are used for indicating the character types of the characters and the positions of the characters in the fields to which the characters belong; the location includes any one or more of: a start position, an intermediate position, and an end position; the character types include any one or more of: key characters, value characters, and other characters.

7. The method of claim 2, wherein the determining, based on the key-value classification result, that the text sequence includes a key field and a value field comprises:

and integrating characters of which the character types are key characters and which belong to the same field in the text sequence into a key field according to the indication of the classification label of each character, and integrating characters of which the character types are value characters and which belong to the same field in the text sequence into a value field.

8. The method of claim 1, wherein the key value classification is performed by calling a named entity model or a location-based single-word classification model, wherein the value field is a named entity in the text sequence, and the key field is a text item corresponding to the named entity.

9. The method of claim 1, wherein converting the image to be processed into a text sequence comprises:

calling a text detection model to perform text recognition on the acquired image to be processed;

and typesetting the text recognition result to obtain a text sequence corresponding to the image to be processed.

10. An image processing apparatus, characterized in that the apparatus comprises:

the processing unit is used for carrying out key value classification on the text sequences, determining key fields and value fields included in the text sequences based on key value classification results, combining the key fields and the value fields in pairs to obtain at least one key value text sequence, acquiring characteristic information of the key fields and the value fields in each key value text sequence, and carrying out pairing processing on the key fields and the value fields in each key value text sequence according to the characteristic information;

11. A terminal device, characterized in that the terminal device comprises a processor and a storage means, the processor and the storage means being interconnected, wherein the storage means is adapted to store a computer program, the computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method according to any one of claims 1 to 9.

12. A computer storage medium having stored thereon program instructions for implementing a method according to any one of claims 1 to 9 when executed.