CN112528894B

CN112528894B - Method and device for discriminating difference term

Info

Publication number: CN112528894B
Application number: CN202011496118.XA
Authority: CN
Inventors: 王亚利; 宋时德; 唐刘建; 庄纪军
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2024-05-31
Anticipated expiration: 2040-12-17
Also published as: CN112528894A

Abstract

The application discloses a difference item judging method and a device, wherein the method comprises the following steps: acquiring a target difference item in the recognition result of the first single sentence and the recognition result of the second single sentence, wherein the target difference item comprises a first difference text and a second difference text, the first single sentence comprises a public item and the first difference text, and the second single sentence comprises the public item and the second difference text; determining a first probability corresponding to the first difference text and a second probability corresponding to the second difference text based on the language prediction model and the common item; and judging whether the target difference item is a real difference item according to the first probability and the second probability. By implementing the method and the device, the real difference item can be effectively judged, and the filtration of the non-real difference item caused by OCR recognition errors is realized, so that the accuracy of single sentence comparison is improved.

Description

Method and device for discriminating difference term

Technical Field

The present application relates to the field of computers, and in particular, to a method and apparatus for determining a difference term.

Background

As business progresses, a large number of documents, such as treaty documents, markups, etc., are generated, wherein modification of the same document may result in multiple documents. For example, for the same contract file, there is one original document and possibly multiple modified versions of the document. To determine which modifications a document of a modified version makes with respect to another document, the contents of the two associated documents need to be compared to determine the difference term.

Two documents to be compared are typically scan-identified using optical character recognition (Optical Character Recognition, OCR) techniques, and the contents of the two documents are then compared based on the recognition results output by the OCR to determine a difference term. However, the OCR recognition process may be erroneous due to noise interference (e.g., watermark, signature, etc.), and thus some of the difference items determined based on the recognition result of the OCR are not true difference items, resulting in low accuracy of the document comparison result.

Disclosure of Invention

The application discloses a difference item judging method and a device, which can effectively judge real difference items and realize the filtration of non-real difference items in a document comparison result, thereby improving the accuracy of document comparison.

In a first aspect, the present application provides a method for discriminating a difference term, the method comprising: acquiring a target difference item in a recognition result of a first single sentence and a recognition result of a second single sentence, wherein the target difference item comprises a first difference text and a second difference text, the first single sentence comprises a public item and the first difference text, the second single sentence comprises the public item and the second difference text, and the length of the first difference text is the same as that of the second difference text; determining a first probability corresponding to the first difference text and a second probability corresponding to the second difference text based on the language prediction model and the common item; and judging whether the target difference item is a real difference item according to the first probability and the second probability.

According to the method, for the target difference item determined after document comparison, the target difference item is predicted through the language prediction model based on the public text of two single sentences corresponding to the target difference item (comprising the first difference text and the second difference text), so that the first probability that the first difference text appears in the first single sentence and the second probability that the second difference text appears in the second single sentence are determined, whether the target difference item is a real difference item or not is determined according to the first probability and the second probability, the real difference item can be effectively identified, and the accuracy of a single sentence comparison result is improved.

In a possible implementation manner of the first aspect, when the length of the first difference text is different from the length of the second difference text and the first difference text includes a semantic independent word, the method further includes: and removing the semantic irrelevant words in the first difference text so that the length of the first difference text is the same as the length of the second difference text.

In a possible implementation manner of the first aspect, determining whether the target difference term is a true difference term according to the first probability and the second probability includes: and under the condition that the first probability is greater than or equal to a first threshold value and the second probability is greater than or equal to the first threshold value, judging the target difference item as a real difference item.

By implementing the implementation manner, the first probability is larger than or equal to the first threshold, namely that the first difference text is recognized correctly, and the second probability is larger than or equal to the first threshold, namely that the second difference text is recognized correctly, so that under the condition that the first difference text and the second difference text are recognized correctly, the target difference item can be judged to be a real difference item because the first difference text and the second difference text are different, thereby realizing accurate recognition of the real difference item and improving the accuracy of document comparison results.

In a possible implementation manner of the first aspect, determining, based on the language prediction model and the common term, a first probability that the first difference text corresponds to and a second probability that the second difference text corresponds to includes: obtaining a target sentence corresponding to a target difference item, wherein the target sentence comprises a public item and a shielding item; inputting the target sentence into a language prediction model to predict the shielding item, and outputting a candidate word list corresponding to the shielding item, wherein the candidate word list comprises a plurality of prediction results and probabilities corresponding to the plurality of prediction results; and acquiring a first probability corresponding to the first difference text and a second probability corresponding to the second difference text from the candidate word list.

According to the implementation mode, the shielding item is predicted based on the context relation in the public item of the target sentence through the language prediction model, a candidate word list is obtained, the candidate word list comprises a plurality of prediction results and probabilities corresponding to the prediction results, the probabilities corresponding to the prediction results indicate the probability that the prediction results appear on the shielding item, and the probabilities that the prediction results are combined with the public item in the target sentence and accord with semantic expression standards can also be indicated, so that the first probability corresponding to the first difference text and the second probability corresponding to the second difference text can be searched and obtained from the candidate word list, and the recognition rate of the real difference item can be improved.

In a possible implementation manner of the first aspect, in a case that the first probability is smaller than the first threshold value or the second probability is smaller than the first threshold value, the method further includes: acquiring first data corresponding to a first difference text and second data corresponding to a second difference text; the first data comprises first information corresponding to the first single sentence and a first image of the first single sentence, the second data comprises second information corresponding to the second single sentence and a second image of the second single sentence, the first information comprises word vectors of each word in the first single sentence, position codes of each word in the first single sentence and position information of each word in the first single sentence, and the second information comprises word vectors of each word in the second single sentence, position codes of each word in the second single sentence and position information of each word in the second single sentence; and judging whether the target difference item is a real difference item or not based on the first data and the second data by using a difference item judging model.

When the implementation manner is implemented, if the first difference text or the second difference text cannot be determined to be correct, the first data corresponding to the first difference text and the second data corresponding to the second difference text are obtained, the first data is taken as an example for explanation, word vectors in the first data represent text information of each word of the first sentence, position codes in the first data represent relative position information of each word of the first sentence in the first sentence, position information in the first data represent pixel coordinates, word height and word width information of each word of the first sentence in an original image, and the first image in the first data represents image information of each word of the first sentence. The description of the second data may refer to the description of the first data. After the first data and the second data are acquired, the similarity between the first data and the second data is compared based on the difference term distinguishing model to judge whether the target difference term is a real difference term, so that the difference term distinguishing model fuses the information such as text, relative position, image and the like of the target difference term to distinguish the target difference term, and the recognition accuracy of the difference term distinguishing model is improved.

In a possible implementation manner of the first aspect, the difference term discrimination model includes a first feature extraction network, a second feature extraction network, a linear processing unit and a classifier, the first feature extraction network is the same as the second feature extraction network, and the determining, using the difference term discrimination model, whether the target difference term is a true difference term based on the first data and the second data includes: inputting first information and a first image to a first feature extraction network to obtain a first feature vector; inputting second information and a second image to a second feature extraction network to obtain a second feature vector; inputting the first feature vector and the second feature vector to a linear processing unit to obtain a third feature vector; and inputting a third feature vector to the classifier to obtain a classification result, wherein the classification result indicates whether the target difference item is a real difference item.

In the implementation manner, the difference term discrimination model includes two identical feature extraction networks (i.e., a first feature extraction network and a second feature extraction network), a linear processing unit and a classifier, where the first feature extraction network acts on first data, the first feature extraction network outputs a first feature vector based on input first information and a first image, the first feature vector represents a high-level semantic feature fused with text, position coding, position information and image information of a first sentence, the second feature vector output by the second feature extraction network represents a high-level semantic feature fused with text, position coding, position information and image information of a second sentence, the third feature vector output by the linear processing unit represents a difference between the high-level semantic feature of the first sentence and the high-level semantic feature of the second sentence, and the classifier outputs a classification result corresponding to a target difference term based on the third feature vector, so as to realize discrimination whether the target difference term is a true difference term, effectively filter a non-true difference term in a comparison result, and improve accuracy of sentence comparison.

In a possible implementation manner of the first aspect, the first feature extraction network includes a first text encoding end, a first image encoding end, and a first deep learning model, the first encoding end is configured to output a first fusion feature according to the first information, the first image encoding end is configured to output a first image feature according to the first image, and the first deep learning model is configured to output a first feature vector according to the first fusion feature and the first image feature; the second feature extraction network comprises a second text encoding end, a second image encoding end and a second deep learning model, wherein the second text encoding end is used for outputting second fusion features according to second information, the second image encoding end is used for outputting second image features according to second images, and the second deep learning model is used for outputting second feature vectors according to the second fusion features and the second image features.

Taking the first feature extraction network as an example, the first fusion feature output by the first text encoding end represents a medium-level semantic feature fused with the text of the first single sentence, the relative position relation (position encoding) of words in the first single sentence and the position information, the first image feature output by the first image encoding end represents the image feature of each word of the first single sentence, the first deep learning model obtains a first feature vector based on the first fusion feature and the first image feature, and the first feature vector represents the high-level semantic feature fused with the text, the position encoding, the position information and the image information of the first single sentence. Therefore, the design of the first feature extraction network enables the first feature extraction network to have better capability of extracting the advanced semantic features of the first single sentence. Because the second feature extraction network is the same as the first feature extraction network, the second feature extraction network has better capability of extracting the advanced semantic features of the second sentence.

In a possible implementation manner of the first aspect, the first information further includes word vectors, position codes and position information of N candidate words corresponding to the first differential text, and the first text encoding end makes correlation between the N candidate words corresponding to the first differential text and words in the first sentence except the first differential text be a preset value; n candidate words corresponding to the first difference text are determined from a candidate word list according to a first preset condition; the second information further comprises word vectors, position codes and position information of N candidate words corresponding to the second difference text, and the second text coding end enables the correlation degree between the N candidate words corresponding to the second difference text and words except the second difference text in the second single sentence to be a preset value; n candidate words corresponding to the second difference text are determined from the candidate word list according to a second preset condition.

According to the implementation manner, the first information further comprises text features (namely word vectors) of the candidate words corresponding to the first differential text and relative position features (comprising position codes and position information) of the candidate words corresponding to the first differential text in the first single sentence, the first text coding end enables correlation between the candidate words corresponding to the first differential text and words except the first differential text in the first single sentence to be a preset value, which means that the influence of the candidate words corresponding to the first differential text on the words except the first differential text in the first single sentence is far smaller than the influence on the first differential text, and therefore the candidate word features corresponding to the first differential text introduced in the first information can correct the first differential text, so that the subsequently extracted first feature vectors are more accurate, the first single sentence (recognized) is closer to the first single sentence in the real document, and the authenticity of the first single sentence is improved.

In a possible implementation manner of the first aspect, in a case that the first probability is smaller than the first threshold value and the second probability is greater than or equal to the first threshold value, the method further includes: determining candidate words corresponding to the first difference text from the candidate word list according to a third preset condition; when the number of the candidate words corresponding to the first difference text is 1, taking the candidate words corresponding to the first difference text as corrected first difference text; when the second difference text is the same as the corrected first difference text, determining that the target difference item is not a real difference item; and when the second difference text is different from the corrected first difference text, determining that the target difference item is a real difference item.

When the number of the candidate words corresponding to the first difference text is 1, the candidate words corresponding to the first difference text are used as corrected first difference text, namely, the corrected first difference text is considered to be recognized correctly, and because the second difference text is recognized correctly (because the second probability is greater than or equal to the first threshold), whether the target difference item is a real difference item can be determined by comparing whether the second difference text is the same with the corrected first difference text, so that accurate judgment of the target difference item is realized, and the accuracy of a single sentence comparison result is improved.

In a possible implementation manner of the first aspect, in a case that the first probability is smaller than the first threshold value and the second probability is smaller than the first threshold value, the method further includes: determining candidate words corresponding to the first difference text from the candidate word list according to a third preset condition; determining candidate words corresponding to the second difference text from the candidate word list according to a fourth preset condition; when the number of candidate words corresponding to the first difference text is 1 and the number of candidate words corresponding to the second difference text is 1, using the candidate words corresponding to the first difference text as corrected first difference text and using the candidate words corresponding to the second difference text as corrected second difference text; when the corrected first difference text is the same as the corrected second difference text, determining that the target difference item is not a real difference item; and when the corrected first difference text is different from the corrected second difference text, determining that the target difference item is a real difference item.

According to the implementation mode, under the condition that the first probability is smaller than the first threshold value and the second probability is smaller than the first threshold value, the first difference text and the second difference text are corrected respectively, whether the target difference item is a real difference item or not is determined according to whether the corrected first difference text is the same as the corrected second difference text, and therefore accurate judgment of the target difference item is achieved, and accuracy of a single sentence comparison result is improved.

In a possible implementation manner of the first aspect, the third preset condition is that a score of a candidate word in the candidate word list is greater than a second threshold, and the score of the candidate word is determined by a probability corresponding to the candidate word and a font similarity between the candidate word and the first difference text; the fourth preset condition is that the score of the candidate word in the candidate word list is larger than a second threshold value, and the score of the candidate word is determined by the probability corresponding to the candidate word and the font similarity between the candidate word and the second difference text.

In this implementation manner, taking the determination of the candidate word corresponding to the first differential text as an example, the higher the font similarity between the candidate word in the candidate word list and the first differential text and the higher the probability of the candidate word, the higher the probability that the candidate word is selected as the candidate word corresponding to the first differential text, and the candidate word corresponding to the first differential text is the candidate word most likely to replace the first differential text. The method is beneficial to improving the recognition accuracy of the difference term discrimination model.

In one possible implementation of the first aspect, the first sentence and the second sentence are from two different documents that are associated.

By implementing the implementation mode, the first sentence and the second sentence are respectively used for related different documents, namely, the judgment of the target difference item is used for document comparison, and the accuracy of a document comparison result is improved.

In a second aspect, the present application provides an apparatus comprising: the device comprises an acquisition unit, a recognition unit and a recognition unit, wherein the acquisition unit is used for acquiring a target difference item in a recognition result of a first single sentence and a recognition result of a second single sentence, the target difference item comprises a first difference text and a second difference text, the first single sentence comprises a public item and the first difference text, the second single sentence comprises the public item and the second difference text, and the length of the first difference text is the same as that of the second difference text; the prediction unit is used for determining a first probability corresponding to the first difference text and a second probability corresponding to the second difference text based on the language prediction model and the public item; and the judging unit is used for judging whether the target difference item is a real difference item according to the first probability and the second probability.

In a third aspect, the present application provides a computing device comprising a processor and a memory, the processor and memory being connected or coupled together by a bus; wherein the memory is used for storing program instructions; the processor invokes program instructions in the memory to perform the method of the first aspect or any of the possible implementations of the first aspect.

In a fourth aspect, the application provides a computer readable storage medium storing program code for execution by an apparatus, the program code comprising instructions for performing the method of the first aspect or any possible implementation of the first aspect.

In a fifth aspect, the present application provides a computer program product comprising program instructions which, when executed by a computing apparatus, performs the method of the first aspect or any of the possible embodiments of the first aspect. The computer software product may be a software installation package which may be downloaded and executed on a computing device to implement the method of the first aspect or any of the possible embodiments of the first aspect, in case the method provided by any of the possible designs of the first aspect is required.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a difference term discrimination based on a language prediction model according to an embodiment of the present application;

FIG. 3A is a block diagram of a BERT-based language prediction model according to an embodiment of the present application;

FIG. 3B is a schematic illustration of a specific construction of Transformer Encoder;

FIG. 4 is a block diagram of a difference term discrimination model according to an embodiment of the present application;

FIG. 5 is a block diagram of a first feature extraction network according to an embodiment of the application;

FIG. 6 is a schematic diagram of input data of a first feature extraction network according to an embodiment of the application;

FIG. 7 is a flowchart of a method for determining a difference term according to an embodiment of the present application;

FIG. 8 is a flowchart of another method for determining a difference term according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a computing device according to an embodiment of the present application;

Fig. 10 is a schematic functional structure of a computing device according to an embodiment of the present application.

Detailed Description

The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The terms first, second and the like in the description and in the claims, are used for distinguishing between different objects and not for describing a particular sequential order.

For ease of understanding, related terms and the like that may be involved in embodiments of the present application are described first.

(1)OCR

OCR refers to the process in which an electronic device (e.g., a scanner or digital camera) examines characters printed on paper, determines their shape by detecting light and dark patterns, and then translates the shape into computer text using a character recognition method. In short, the process of analyzing and processing the image file obtained after the text data is scanned to obtain the text and layout information is performed. OCR saves the manpower and time required to enter text data from a keyboard.

The conventional OCR implementation process is generally divided into several steps: image acquisition, preprocessing, line segmentation and character recognition. The image acquisition refers to acquiring a scanned image corresponding to a paper document, the preprocessing comprises binarization, image enhancement, noise processing, image filtering, inclination correction and the like, and the preprocessing can optimize the image quality of the scanned image so as to improve the accuracy of the subsequent character recognition. The line segmentation refers to detecting single characters or continuous characters, and the character recognition refers to importing segmented character images into a recognition model for processing, so as to obtain character information in an original image.

Of course, OCR can also be implemented based on deep learning, mainly in two ways, one is to complete the detection and recognition of characters in one go through an end-to-end model, for example, STN-OCR uses a single deep neural network to detect and recognize characters from images in a semi-supervised learning manner. For another example, FOTS (Fast Oriented Text Spotting) is a rapid end-to-end character detection and recognition framework, and the time required for feature extraction is reduced by sharing training features and complementary supervision, so that the character detection and recognition efficiency is improved. The other is divided into two stages of character detection and character recognition, wherein the character detection refers to positioning of a character area in an image, for example, CTPN (Connectionist Text Proposal Network) algorithm, EAST (EFFICIENT AND Accurate Scene Text detectionpipeline) algorithm and the like, and the character recognition refers to recognizing the positioned character, for example, CNN+softmax and the like.

(2) BERT model

The bi-directional transducer encoder representation (Bidirectional Encoder Representations from Transformers, BERT) is a language representation model in which each word in the attention matrix contains information of all words before and after. BERT is a method of pre-training language representations, typically by training a generic "language understanding" model over a large corpus of text (e.g., wikipedia), which can then be used to perform relevant natural language processing (Natural Language Processing, NLP) tasks.

The BERT model may be used to make language predictions. For example, given a sentence, occluding one or more words in the sentence requires predicting the occluded word from the remaining words in the sentence. The BERT model may be trained in a manner that masks the language model (Masked Language Modeling, MLM), i.e., when a sentence is entered, randomly selects words to be predicted, 80% of which are replaced by masks, 10% of which are replaced by a random word, and 10% of which are considered as words themselves, i.e., the word to be predicted in the original sentence is replaced by a special symbol. Although the BERT model can see the input information at all positions, the BERT model cannot know what words are at the positions in advance because the words to be predicted are replaced by special symbols, so that the BERT model can learn to predict the filled words at the positions according to the context information, and the probability of correctly predicting the BERT model is as large as possible by adjusting the parameters of the BERT model, thereby realizing the training of the BERT model.

The BERT model contains a multi-layer transducer structure, where the transducer structure is a Attention-based (Attention) network structure. The core idea of Attention is to calculate the correlations between each word in a sentence and all words in the sentence, which reflect the relevance and importance of the different words in the sentence to some extent. Thus, new expressions for each word can be obtained by adjusting the importance (weight) of each word using these correlations. The new expression contains not only the characteristics of the word itself, but also the relation between other words and the word, so the new expression is a more global expression compared with a simple word vector. The transducer structure creates the final text representation by constantly overlapping the input text with such layers of attention mechanisms and common non-linear layers. Thus, the self-attention mechanism is to learn the dependency relationship between words inside the sentence, capturing the internal structure of the sentence.

(3) LSTM model

A Long Short-Term Memory (LSTM) is a time Recurrent Neural Network (RNN) and is characterized by a time-loop structure, which can well characterize sequence data with time-space correlation, including time-series data (e.g., text, traffic flow, air temperature, etc.). However, the RNN uses hidden layers as the memory modules of the model, and is directly connected with other parts of the network, so that the problem of gradient disappearance is caused by excessive layers after the model is unfolded, and effective historical information cannot be stored for a long time due to the influence of continuously new input data, so that the RNN is good at short-term memory and not good at long-term memory, and in order to enable the RNN to have long-term memory, an LSTM is constructed, and the problems of gradient elimination and gradient explosion in the long-sequence training process are mainly solved. In the field of natural language processing, LSTM is typically used to extract semantic grammar information for text and then work with downstream models to do specific tasks such as classification, sequence labeling, text matching, etc.

The LSTM cell is composed of a memory cell and a plurality of regulating gates (gates). LSTM uses the state (state) of a memory cell (memory cell) to hold history information. The operations of performing a state update using input data and outputting state information are respectively regulated by two gates-input gate (input gate) and output gate (output gate). When the input gate is closed, the history information is not disturbed by the new input data, and is kept as it is (constant error carrousel), and similarly, only the output gate is opened, and the history information in the memory unit is active.

In the related art, two documents to be compared are generally recognized by OCR technology, and then different contents are determined based on the result of recognition of the two documents. And different contents have a common text portion in corresponding single sentences in the recognition results of the two documents, the different contents may be referred to as a difference term. For example, for the document 1 and the document 2, after OCR recognition, there is "marketing interest or marketing paralysis" in the recognition result of the document 1, and "marketing interest or marketing dysentery" in the recognition result of the document 2, a set of difference items is determined to be "paralysis" and "dysentery" according to the recognition result, where "paralysis" is derived from the recognition result of the document 1, and "dysentery" is derived from the recognition result of the document 2, and "marketing interest or marketing" is a public text portion corresponding to the difference items.

It is easy to understand that when OCR recognition is correct, that is, the original sentence in document 1 is "market interest rate or market breakdown", and the original sentence in document 2 is "market interest rate or market dysentery", paralysis "and" dysentery "are both recognized correctly, so paralysis" and "dysentery" are a set of real difference terms. However, in the case where there is a recognition error in OCR, that is, the original sentence in the document 1 and the document 2 is "market-priced or market-paralyzed", the "paralyzed" two words in the sentence "market-priced or market-paralyzed" in the document 1 are accurately recognized as "paralyzed" in the OCR recognition of the document 1, and in the OCR recognition of the document 2, since the sentence "market-priced or market-paralyzed" has a watermark or signature or the like just covering the "paralyzed" two words in the sentence "market-priced or market-paralyzed", noise interference can be generated to the recognition of the document, and the "paralyzed" two words in the sentence "market-priced or market-paralyzed" in the document 2 are recognized as "dysentery", that is, the recognition error occurs, in this case, the "paralyzed" and "dysentery" are not true difference terms, but the recognition error does not occur, and therefore, the "paralyzed" and "dysentery" is not known as a set of true difference terms, that the difference terms occur, and the accuracy of the document comparison result is low.

The application provides a difference item judging method, which can effectively determine real difference items and realize the filtration of non-real difference items caused by OCR recognition errors, thereby improving the accuracy of document comparison.

Referring to fig. 1, a schematic diagram of a system architecture 100 is provided in an embodiment of the present application, where a data acquisition device 160 is configured to acquire training data, where the training data includes a corpus for training a language prediction model, and the training data further includes sample data for training positive difference term samples and negative difference term samples of a difference term discrimination model, where the positive difference term samples represent real difference terms, and the negative difference term samples represent non-real difference terms. Alternatively, the data obtaining device 160 may directly obtain a marked positive difference item sample and a marked negative difference item sample, or may mark the documents to be compared by itself to obtain a positive difference item sample and a negative difference item sample, for example, the data obtaining device 160 performs OCR recognition on at least one group of documents to be compared to obtain a plurality of candidate difference items, and then marks the plurality of candidate difference items based on real data corresponding to the difference items in the at least one group of documents to be compared to obtain a positive difference item sample and a negative difference item sample. Further, the data obtaining device 160 is further configured to obtain sample data of a positive difference item sample, where the sample data of the positive difference item sample includes text and position information of a sentence in which the positive difference item sample is located and an image of the sentence in which the positive difference item sample is located. Similarly, the data obtaining device 160 also obtains sample data of a negative difference item sample, where the sample data of the negative difference item sample includes text and position information of a sentence in which the negative difference item sample is located and an image of the sentence in which the negative difference item sample is located.

After the training data is collected, the data obtaining device 160 may store the training data in the database 130, and the training device 120 trains the initial language prediction model and the initial difference term discrimination model based on the training data maintained in the database 130, to finally obtain the target language prediction model 101 and the target difference term discrimination model 102. In some possible embodiments, the data acquisition device 160 may also send training data directly to the training device 120 to cause the training device 120 to train the initial language prediction model based on a corpus in the training data to obtain the target language prediction model 101, and train the initial difference term discrimination model based on sample data of positive difference term samples and sample data of negative difference term samples in the training data to obtain the target difference term discrimination model 102.

The process of obtaining the target language prediction model 101 by the training device 120 based on the training data is described below, the input data of the training device 120 is a plurality of sentences in a corpus, each sentence is a correct expression conforming to a grammar specification, the length of each sentence is smaller than a preset threshold, the training device 120 randomly masks (masks) some words from the input sentences during training, does not prevent some words from being masked from being called original text, predicts a blank part in the sentence through context information to obtain a predicted text, and then compares the predicted text with the original text until the difference between the predicted text output by the training device 120 and the original text is smaller than the preset threshold, so that the output predicted text can be considered to replace the original text, thereby completing training of the target language prediction model 101. Specific training processes of the target language prediction model 101 will be described in detail later.

The process of obtaining the target difference term discrimination model 102 by the training device 120 based on the training data is described below, the input data of the training device 120 is sample data of a positive difference term sample and sample data of a negative difference term sample, the training device 120 obtains two vectors corresponding to each set of difference terms based on the input data, taking one set of difference terms as an example, where the set of difference terms includes a text a identified from the document 1 and a text B identified from the document 2, and the two vectors are fusion feature vectors corresponding to the text a and fusion feature vectors corresponding to the text B, and the manner of obtaining the two vectors corresponding to each set of difference terms is described in detail below, which is not repeated herein. The training device 120 classifies the set of difference items based on the two vectors corresponding to each set of difference items, that is, determines whether the set of difference items are real difference items, and trains the initial difference item discrimination model based on the classification result corresponding to each difference item and the labeling information corresponding to the difference item (that is, the positive difference item sample or the negative difference item sample) until the classification result output by the initial difference item discrimination model is consistent with the labeling information corresponding to the input information, thereby completing the training of the target difference item discrimination model 102. The specific training process of the target difference term discrimination model 102 is described in detail below.

The target language prediction model 101 and the target difference term discrimination model 102 described above can be used to implement the difference term discrimination method provided by the embodiment of the present application. Specifically, the two documents to be compared input by the user terminal 140 are subjected to related processing by the processing module 113 to obtain a target difference item, a single sentence corresponding to the target difference item is extracted and input into the target language prediction model 101, whether the target difference item is a real difference item is judged according to the prediction result, a judgment result is obtained, and the judgment result is output to the user terminal 140 through the I/O interface 112. For the target difference item which cannot be judged by the target language prediction model 101, performing related processing by the processing module 113 to obtain a candidate word corresponding to the target difference item, obtaining text and position information after a single sentence corresponding to the target difference item is merged into the candidate word and an image of a single sentence corresponding to the target difference item, inputting the single sentence corresponding to the target difference item into the text and position information after the single sentence corresponding to the target difference item is merged into the candidate word and the image of the single sentence corresponding to the target difference item into the target difference item judging model 102 to output a verification result of the target difference item, and outputting the verification result of the target difference item to the user terminal 140 through the I/O interface 112. The method for determining the difference term may refer to the following related description, and will not be described herein.

Optionally, the target language prediction model 101 in the embodiment of the present application may be specifically a BERT-based language prediction model, and the target difference term discrimination model 102 may be a deep learning model fused with text, position and image information. It should be noted that, in practical applications, the training data maintained in the database 130 is not necessarily all acquired by the data acquisition device 160, but may be acquired from other devices. It should be noted that, the training device 120 does not need to train the target language prediction model 101 and the target difference term discrimination model 102 based on the training data maintained by the database 130, and may acquire the training data from the cloud or other places for model training, which should not be taken as a limitation of the embodiments of the present application. The training device 120 may exist independently of the execution device 110 or may be integrated within the execution device 110.

The target language prediction model 101 and the target difference term discrimination model 102 obtained by training according to the training device 120 may be applied to different systems or devices, such as the execution device 110 shown in fig. 1, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, or a server or a cloud terminal. In fig. 1, an execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through a user terminal 140, where the input data may include in an embodiment of the present application: scanned parts of documents to be compared.

The processing module 113 is configured to perform preprocessing according to input data (e.g., a scanned document of a document to be compared) received by the I/O interface 112 to obtain a target difference item and extract a single sentence corresponding to the target difference item, and the processing module 113 is further configured to perform subsequent processing according to an output of the computing module 111.

In the processing of the input data by the execution device 110, or in the processing related to the execution of the computation by the computation module 111 of the execution device 110, the execution device 110 may call the data, the code, etc. in the data storage system 150 for the corresponding processing, or may store the data, the instruction, etc. obtained by the corresponding processing in the data storage system 150.

It should be noted that, the training device 120 may generate, for different targets, the corresponding target language prediction model 101 and the target difference term discrimination model 102 based on different training data, where the two models may be used to achieve the targets, so as to provide the user with a desired result, for example, may provide the user with a real difference term in the document to be compared in the present application.

The difference term discriminating method provided by the application is described in detail below. Predicting a difference item through a language prediction model, judging whether the difference item is identified correctly based on the prediction probability corresponding to the difference item, and determining whether the difference item is a real difference item under the condition that the difference item is identified correctly; and under the condition that whether the difference item is identified correctly cannot be determined, further extracting relevant information of the difference item, and inputting the relevant information into a difference item judging model to determine whether the difference item is a real difference item. The above-mentioned target language prediction model 101 is a language prediction model described below, and the target difference term discrimination model is a difference term discrimination model described below.

First, the method for discriminating the difference term based on the language prediction model is described in detail.

Specifically, at least one group of target difference items and position information corresponding to each group of target difference items are obtained according to the identification results of two related documents, a group of target difference items are taken as an example for description, each group of target difference items comprise a first difference text and a second difference text, the position information corresponding to each group of target difference items comprises a first position information corresponding to the first difference text and a second position information corresponding to the second difference text, a first single sentence in which the first difference text is located is extracted based on the first position information, a second single sentence in which the second difference text is located is extracted based on the second position information, a target sentence corresponding to the target difference item is obtained according to the first single sentence and the second single sentence, the target sentence comprises a public item and an occlusion item, a target sentence is input into a language prediction model for predicting the occlusion item to obtain a candidate word list, the candidate word list comprises a plurality of prediction results corresponding to the occlusion item and a plurality of probabilities corresponding to the prediction results, then the first probability corresponding to the first difference text and the second probability corresponding to the second difference text are searched in the candidate word list, and when the first probability and the second probability corresponding to the second difference text meet the preset condition are the target difference item.

Wherein the first and second differential texts have the same length, in other words, the first and second differential texts have the same number of words. The first difference text or the second difference text comprises at least one word.

In some possible embodiments, the first difference text includes a semantic independent word, which may be a mood aid word, a temporal aid word, or a structural aid word, etc., e.g., any one of, for example, o, ya, etc., the length of the first difference text is caused to be different from the length of the second difference text, in which case the semantically irrelevant words in the first difference text are removed so that the length of the first difference text is the same as the length of the second difference text.

In some possible embodiments, if no semantic independent word exists in the first difference text and the first difference text, and the length of the first difference text is different from the length of the second difference text, in this case, the target difference item may be directly determined to be a real difference item.

In one embodiment, the target difference term is obtained by: scanning the two related documents to obtain two scanned files, performing OCR (optical character recognition) on the scanned files to obtain text contents, performing paragraph segmentation processing on the text contents, and comparing paragraphs in the two scanned files by adopting a paragraph-based sequence comparison algorithm to obtain target difference items and position information of the target difference items, wherein the position information of the target difference items comprises pixel coordinates of the target difference items in the scanned files.

The target sentences corresponding to the target difference item are obtained according to the first single sentence and the second single sentence specifically: the first sentence is a sentence in which the first difference text is located, the second sentence is a sentence in which the second difference text is located, and it can be understood that other parts except for the target difference item in the two sentences are identical, so that the identical text parts in the first sentence and the second sentence are extracted as common items of the target sentence, the shielding item in the target sentence represents different text parts in the first sentence and the second sentence, the shielding item comprises at least one mask, the number of masks in the shielding item is identical to the number of words in the first difference text or the number of words in the second difference text, and the relative position of each word in the common item in the target sentence is unchanged. It should be noted that, the first sentence and the second sentence are both extracted from the recognition result of the corresponding document based on the corresponding position information.

For example, assuming that the two documents are a first document and a second document, a set of target difference items are obtained from the recognition results of the first document and the second document, wherein "paralysis" is a first difference text, "dysentery" is a second difference text, a first sentence corresponding to the first difference text is "interest marketizing or market paralysis" extracted from the recognition result of the first document based on the position information of "paralysis", a second sentence corresponding to the second difference text is "interest marketizing or market dysentery" extracted from the recognition result of the second document based on the position information of "dysentery", and a masking operation is performed on the target difference item to obtain a target sentence corresponding to the target difference item, wherein "interest marketizing or market" is a common item, the relative position of each word in the target sentence is unchanged, and "[ mask ]" is a masking item.

In one implementation, after a target sentence corresponding to a target difference item is obtained, the target sentence is input into a language prediction model, which may take a BERT-based language prediction model as an example, the BERT-based language prediction model predicts an occlusion item by using context information of the occlusion item in the target sentence to obtain a candidate word list, and a plurality of prediction results corresponding to the occlusion item and probabilities corresponding to the plurality of prediction results are listed in the candidate word list, where the probabilities corresponding to the prediction results represent probabilities of occurrence of the prediction results at mask positions in the target sentence. Referring to fig. 2, fig. 2 is a schematic diagram of a BERT-based language prediction model, as shown in fig. 2, a target sentence is "interest marketized or market [ mask ] [ mask ]", a target sentence is input into the BERT-based language prediction model, and the BERT-based language prediction model predicts two masks in the target sentence according to the interrelationship between words in the "interest marketable or market" in the target sentence, so as to output a candidate word list including a plurality of prediction results, and the plurality of prediction results in the candidate word list may be arranged from high to low according to the size of the corresponding probability, so that for the target sentence, the words appearing on the positions of the "interest marketized or market [ mask ] [ mask ]," mask ] "may be" paralysis "," behavior "," mechanism "or" criterion ", and the like.

The structure of the BERT-based language prediction model may specifically refer to fig. 3A, and as shown in fig. 3A, the BERT-based language prediction model is composed of BERT and a classifier, where BERT includes a plurality Tranformer Encoder and a plurality Tranformer Encoder stacked layer by layer. Firstly, inputting sentences or sentence pairs, taking target sentences' interest marketization or market [ mask ] [ mask ] "as examples, firstly, obtaining word vectors and position codes of the target sentences, wherein the position codes represent the sequence of words in the sentences and can well express the distance between words. The word vector and the position code of the target sentence are input into the BERT, the BERT obtains the semantic enhancement vector of each word in the target sentence based on the word vector and the position code of the target sentence through a plurality of Tranformer Encoder inside and outputs the semantic enhancement vector to the classifier, and finally the classifier carries out logistic regression classification on the output of the BERT to output a prediction result, for example, a candidate word and a probability corresponding to the candidate word. It should be noted that the classifier may be a full-connection layer, where each neuron in the full-connection layer is fully connected with the output of the BERT, so as to integrate the local information with class distinction in the BERT, and finally, output the prediction result after performing logistic regression classification on the obtained data.

The specific structure of which will be described by taking one Tranformer Encoder as an example is shown in fig. 3B,Tranformer Encoder, which includes Multi-head Self-Attention (Multi-head Self-Attention), layer normalization (Layer Normalization), and linear transformation (Linear Transformation). The Multi-head Self-attribute can be understood as different fusion modes of semantic vectors of the target word and other words in the text under various semantic scenes. The input and output of the Multi-head Self-attribute are identical in form, the input is the original vector of each word in the text, and the output is the semantic enhancement vector of each word fused with the full text semantic information. In addition, the residual connection is used to superimpose the input of the module on the output to make training of the model easier. Layer normalization is similar to batch normalization, which is a normalization along the dimension of word embedding. The linear transformation is used for carrying out twice linear transformation on the vector after the layer standardization, thereby obtaining a semantic enhancement vector so as to enhance the expression capacity of the whole model.

Multi-head Self-attribute is a cascade of multiple Self-attributes, which are described below. Self-Attention mainly involves three concepts: query vector (Query, Q), key vector (Key, K), and Value vector (Value, V). Self-Attention takes the target word as Q, each word of the context thereof as K, the similarity of Q and each K as weight, and the V of each word of the context is merged into the original V of the target word, and Q, K, V come from the same input text. It is simply understood that Q is a feature of the target word, K is a feature of the context information, V is a content of the context information, and K corresponds to V one-to-one. Specifically, the semantic vector representation of the target word and each word of the context is taken as input, firstly, the Q vector representation of the target word, the K vector representation of each word of the context and the original V vector representation of each word of the target word and the context are obtained through linear transformation, then, the similarity of the Q vector and each K vector is calculated as weight, and the V vector of the target word and the V vector of each word of the context are weighted and fused as output of the Attention, namely: the semantic enhancement vector of the target word.

It can be seen that the computation of the Attention is mainly divided into three steps, wherein the first step is to calculate the similarity between Q and each K to obtain the weight, and the common similarity functions include dot product, splicing, perceptron and the like; the second step is typically to normalize the weights using a softmax function; and finally, carrying out weighted summation on the weight and the corresponding V to obtain the final attention.

For example, for the target sentence "interest rate marketization or market [ mask ] [ mask ]", the Self-Attention specific procedure is: and sequentially taking each word in the target sentence as a target word, and calculating the interrelationship between the word and all the words in the context, so as to obtain the semantic enhancement vector of each word in the sentence.

The principle of Self-Attention can be found in equation (1):

Wherein QK ^T denotes the similarity between each Q vector (feature of the target word) and each K vector (feature of the context information), Distribution (i.e., weights) representing the degree of similarity of each target word to each context information, and Attention (Q, K, V) represents a weighted summation of the value vectors, thereby obtaining semantic enhancement vectors for each target word.

It should be noted that, the language prediction model is trained in advance, but the corpus of the training language prediction model is targeted, for example, if the language prediction model is used for predicting words in sentences in a contract, the corpus of the training language prediction model is various contract documents. For another example, if the language prediction model is used to predict words in sentences in the markup, the corpus of the training language prediction model is various markup books. The corpus of the training language prediction model is not particularly limited.

Further, after the candidate word list is obtained, since the target difference item is obtained in advance, the first difference text and the second difference text are known, the first probability corresponding to the first difference text and the second probability corresponding to the second difference text are searched in the candidate word list, and the relation of the first probability, the second probability and the first threshold is compared respectively to determine whether the target difference item is a real difference item or not, wherein the first threshold is set based on experience.

The searching the first probability corresponding to the first difference text in the candidate word list specifically comprises the following steps: matching the first difference text with each candidate word in the candidate word list, namely when the first difference text is identical to a certain prediction result in the candidate word list, indicating that the first difference text exists in the candidate word list, and successfully matching the first difference text with the prediction result, wherein the first probability is the probability corresponding to the prediction result of successful matching of the first difference text; when the first difference text is different from each predicted result in the candidate word list, it is also indicated that the first difference text does not exist in the candidate word list, and the first difference text fails to match, and the first probability is set to a preset value, for example, 0, 0.01 or other values. It should be noted that, the case where the first difference text does not exist in the candidate vocabulary may be: the first difference text is rarely used, a symbol, a pictographic character, an expression, a traditional Chinese character and the like.

For example, taking the candidate vocabulary of fig. 2 as an example, assuming that the first difference text is "paralysis", the "paralysis" is sequentially matched with four prediction results (including "paralysis", "behavior", "mechanism" and "criterion") in the candidate vocabulary, and the first difference text is successfully matched with the "paralysis" in the candidate vocabulary, so the first probability is the probability corresponding to the "paralysis" in the candidate vocabulary. If the first difference text is assumed to be "extensive", the "extensive" is sequentially matched with the four predicted results (including "paralysis", "behavior", "mechanism" and "criterion") in the candidate vocabulary of fig. 2, and the first difference text is found to be "extensive" and is different from each of the four predicted results, so that the first difference text fails to be matched, and the first probability corresponding to the first difference text "extensive" may be set to 0.

The first probability and the second probability are described in a related manner: when the first difference text (or the second difference text) is a single word, each prediction result in the candidate word list is also a single word, so the first probability (or the second probability) is the probability corresponding to the first difference text. When the first difference text (or the second difference text) contains a plurality of words, each prediction result in the candidate vocabulary is also composed of a plurality of words, and each word in each prediction result has a corresponding probability, in which case the first probability is obtained based on the probabilities of the respective words in the first difference text in the candidate vocabulary, and similarly the second probability is obtained based on the probabilities of the respective words in the second difference text in the candidate vocabulary. For example, the first probability may be obtained by weighted summation of probabilities of respective words in the first difference text. As another example, the first probability may be a smallest probability among probabilities of the respective words of the first differential text. Similarly, the second probability may be obtained by referring to the first probability, which is not described herein.

In one implementation, when the first probability is greater than or equal to the first threshold and the second probability is greater than or equal to the first threshold (i.e., the first probability and the second probability satisfy the preset condition), it is indicated that the first difference text is correctly recognized and the second difference text is correctly recognized, and it is determined that the target difference item composed of the first difference text and the second difference text is a true difference item. If the first probability is smaller than the first threshold, it cannot be determined whether the first difference text is recognized correctly, and similarly, if the second probability is smaller than the first threshold, it cannot be determined whether the second difference text is recognized correctly, and therefore, it cannot be determined whether the target difference item is a real difference item.

In some possible embodiments, when the first probability is smaller than the first threshold or the second probability is smaller than the second threshold, it is accordingly stated that whether the first difference text and/or the second difference text are recognized correctly cannot be determined, in this case, error correction processing is performed on the first difference text and/or the second difference text based on the candidate vocabulary, and then it is determined whether the target difference item after error correction is a real difference item.

The error correction processing process specifically comprises the following steps: taking the first difference text as an example to carry out an exemplary description of the error correction process (if the first difference text cannot be identified correctly), firstly, sequentially calculating the font similarity between each predicted result and the first difference text in the candidate word list, obtaining the font similarity corresponding to each predicted result, carrying out weighted summation on the font similarity corresponding to each predicted result and the probability corresponding to the predicted result, obtaining the score of each predicted result corresponding to the first difference text, comparing the score of each predicted result corresponding to the first difference text with a second threshold, and taking the predicted result as the corrected first difference text and taking the corrected first difference text as the correct identification result when the score of only one predicted result in the scores of each predicted result is larger than or equal to the second threshold. When the scores of the plurality of predicted results in the scores of the predicted results corresponding to the first differential text are all greater than or equal to the second threshold, in this case, the corrected first differential text cannot be obtained, and error correction cannot be performed on the first differential text or error correction failure of the first differential text.

In some possible embodiments, in the process of correcting the first difference text, the S groups of prediction results before the ranking of the candidate vocabulary from high to low probability may be selected, S is a positive integer, S is set according to manual experience, and then the font similarity between the first difference text and each of the first S prediction results is calculated, so that the similarity between each prediction result and the first difference text in the candidate vocabulary is not required to be calculated, and the processing efficiency of the correction process is greatly improved.

In some possible embodiments, when the scores of the plurality of predicted results in the scores of the predicted results corresponding to the first difference text are all greater than or equal to the second threshold, the scores of the predicted results may be arranged in a descending order, and the first N predicted results with higher scores are selected as N candidate words corresponding to the first difference text for use in the subsequent difference term discrimination model. It will be appreciated that if S predicted results are selected from the candidate vocabulary for scoring calculation, N candidate words are determined from the S predicted results, and N is less than S.

In some possible embodiments, when the first probability is smaller than the first threshold or the second probability is smaller than the first threshold, the N candidate words corresponding to the first difference text and the N candidate words corresponding to the second difference text may be determined from the candidate word list directly (i.e. calculating the scores of the respective predictions according to the font similarity and the probability and sorting the scores of the respective predictions) by the above method without comparing the scores of the respective predictions corresponding to the first difference text or the scores of the respective predictions corresponding to the second difference text with the second threshold, so as to be used for the difference term discrimination model.

It should be noted that, the font similarity may be measured by an edit distance based on the ideographic description sequence (Ideographic Description Sequences, IDS), and the smaller the edit distance, the higher the font similarity between the first difference text and a certain prediction result is described. Where IDS is the relative position of internal word building blocks describing a word with twelve defined combinations of characters, thereby obtaining a description sequence of the word. The edit distance refers to the minimum number of editing operations required to convert from one to another between two strings, where the editing operations may be: replacing one character with another, inserting a character, or deleting a character.

In one implementation, if the second difference text in the target difference item is identified correctly (i.e., the second probability is greater than or equal to the first threshold), but it cannot be determined whether the first difference text is identified correctly (i.e., the first probability is less than the first threshold), if the corrected first difference text is obtained after correcting the first difference text, it is determined whether the corrected target difference item is a real difference item, specifically, the second difference text and the corrected first difference text are compared, and if the second difference text and the corrected first difference text are the same, it is determined that the corrected target difference item is not a real difference item; if the target difference item and the target difference item are different, the target difference item after error correction is judged to be a real difference item.

In another implementation, if the first difference text and the second difference text in the target difference item cannot determine whether the recognition is correct (i.e., the first probability and the second probability are both smaller than the first threshold), error correction processing is performed on the first difference text and the second difference text according to the above method, and then whether the corrected target difference item is a real difference item is determined. It can be understood that only after the corrected first difference text and the corrected second difference text are obtained, it can be determined whether the corrected target difference item is a true difference item, specifically, when the corrected first difference text is different from the corrected second difference text, the corrected target difference item is a true difference item; when the corrected first difference text is the same as the corrected second difference text, the corrected target difference term is not the true difference term. When at least one of the corrected first difference text and the corrected second difference text is not available, it is necessary to determine whether the target difference item is a true difference item using a difference item determination model.

It can be seen that by implementing the embodiment of the application, the probability corresponding to each item in the target difference item is obtained through the language prediction model, the discrimination of the target difference item is realized based on the probability corresponding to each item in the target difference item, and the real difference item can be effectively determined. In addition, the similarity between each item in the target difference item and each predicted result of the candidate word list output by the language prediction model is further combined to judge the target difference item which cannot be determined, so that the true difference item is determined, and the accuracy of the document comparison result is improved.

In the process of judging the target difference item based on the language prediction model and the error correction process, there are some situations that it cannot be determined whether the target difference item is a real difference item. For example, the first probability that the first difference text corresponds to is less than a first threshold or the second probability that the second difference text corresponds to is less than a first threshold. For another example, in the process of correcting the first difference text and/or the second difference text, a plurality of numbers of the prediction results corresponding to the first difference text with the scores greater than the second threshold value or a plurality of numbers of the prediction results corresponding to the second difference text with the scores greater than the second threshold value are obtained. In this case, it is necessary to implement discrimination of the target difference item using a difference item discrimination model, which is a deep learning model fused with text, position, and image information.

The difference item judging model judges the target difference item based on the input data corresponding to the first difference text and the input data corresponding to the second difference text so as to output a judging result of the target difference item. Referring to fig. 4, fig. 4 is a block diagram example of a difference term discrimination model provided by an embodiment of the present application, where the difference term discrimination model includes a first feature extraction network, a second feature extraction network, a linear processing unit, and a classifier, and an output of the first feature extraction network and an output of the second feature extraction network are respectively connected with an input of the linear processing unit, and an output of the linear processing unit is connected with an input of the classifier.

Specifically, the first feature extraction network extracts a first feature vector based on input data corresponding to a first differential text, the first feature vector represents a high-level semantic feature fused with text, position coding, position information and image information of a first single sentence, the second feature extraction network extracts a second feature vector based on input data corresponding to a second differential text, the second feature vector represents a high-level semantic feature fused with text, position coding, position information and image information of a second single sentence, the linear processing unit is used for carrying out linear processing on the first feature vector and the second feature vector to obtain a third feature vector, the third feature vector represents a difference between the high-level semantic feature of the first single sentence and the high-level semantic feature of the second single sentence, and the classifier is used for carrying out secondary classification on the third feature vector to judge whether a target difference item formed by the first differential text and the second differential text is a real difference item. The first feature extraction network and the second feature extraction network are identical. The input data corresponding to the first differential text and the input data corresponding to the second differential text are described in the following, and are not described in detail herein.

Since the first feature extraction network corresponding to the first differential text and the second feature extraction network corresponding to the second differential text are identical, the first feature extraction network corresponding to the first differential text may be taken as an example, so as to describe the internal structure of the first feature extraction network and the input and output of each part.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a first feature extraction network provided by an embodiment of the present application, where the first feature extraction network includes a first text encoding end, a first image encoding end, and a first deep learning model, and an output of the first text encoding end and an output of the first image encoding end are respectively connected with an input of the first deep learning model. As can be seen from fig. 5, the first feature extraction network outputs a first feature vector based on the corresponding input data, and the input data corresponding to the first differential text includes first information corresponding to the first sentence and a first image of the first sentence. Specifically, the first text encoding end extracts a first fusion feature based on the first information, the first fusion feature represents a text fused with the first single sentence, a position code of a word in the first single sentence and a middle-level semantic feature of the position information, and the first image encoding end extracts a first image feature based on the first image, wherein the first image feature represents an image feature of each word in the first single sentence. The first deep learning model is used for fusing the first fused feature and the first image feature to output a first feature vector.

The first information comprises a word vector of each word in the first single sentence, a position code of each word in the first single sentence and position information of each word in the first single sentence, wherein the word vector of the word represents text information of the word, the position code of the word represents relative positions of the word in the first single sentence, and the position information of the word represents pixel coordinates, word height and word width of the word in an original image. The first feature extraction network may output a first feature vector based on the first information and the first image.

In some possible embodiments, the candidate word corresponding to the first difference text may also be introduced into the first information, where the first information specifically includes: the first sentence is fused with the word vectors corresponding to the N candidate words, the first sentence is fused with the position codes corresponding to the N candidate words, and the first sentence is fused with the position information corresponding to the N candidate words. In the first information, N candidate words correspond to the first difference text. The word vectors corresponding to the N candidate words after the first single sentence is fused comprise word vectors of each word in the first single sentence and word vectors of the N candidate words corresponding to the first difference text, the position codes corresponding to the N candidate words after the first single sentence is fused comprise the position codes of each word in the first single sentence and the position codes of the N candidate words corresponding to the first difference text, and the position information corresponding to the N candidate words after the first single sentence is fused comprises the position information of each word in the first single sentence and the position information of the N candidate words corresponding to the first difference text. It should be noted that, the determination manner of the N candidate words corresponding to the first difference text may refer to the above description, and will not be repeated herein.

When the first information includes related information (i.e., word vector, position code, and position information) of N candidate words corresponding to the first differential text, the first fusion feature represents a mid-level semantic feature in which the text of the first sentence, the relative position (position code) and position information of the word in the first sentence, the text of the candidate word corresponding to the first differential text, the relative position of the candidate word in the first sentence, and the position information of the candidate word are fused. Accordingly, when the related information of the candidate word corresponding to the first differential text is introduced into the first information, the first feature vector represents the advanced semantic features of the text, the position code, the position information and the image information fused with the first single sentence and the text, the position code and the position information of the candidate word corresponding to the first differential text. Correspondingly, if N candidate words corresponding to the first differential text are introduced into the first information, N candidate words corresponding to the second differential text are also introduced into the second information, in which case the second fusion feature represents a medium-level semantic feature of the text fused with the second single sentence, the relative position (position code) and position information of the word in the second single sentence, the text of the candidate word corresponding to the second differential text, the relative position of the candidate word in the second single sentence and the position information of the candidate word, and the second feature vector represents a high-level semantic feature of the text, the position code, the position information and the image information fused with the text, the position code and the position information of the candidate word corresponding to the second differential text. The third feature vector represents the difference between the high-level semantic features of the first sentence after the N candidate words corresponding to the first difference text are introduced and the high-level semantic features of the first sentence after the N candidate words corresponding to the second difference text are introduced.

It can be seen that the possible real content features corresponding to the position of the first difference text, namely a plurality of candidate words corresponding to the first difference text, are introduced into the first information, wherein the number of the candidate words is N, and the N is preset according to manual experience. Thus, the input of the first text encoding side is improved. In addition, in order to make the plurality of candidate words corresponding to the first differential text influence only the first differential text, but not the other words in the first single sentence where the first differential text is located, a matrix M is set for encoding at the subsequent first text encoding end, so that the plurality of candidate words corresponding to the introduced first differential text influence only the first differential text.

It should be noted that, when the first differential text includes only one word, the position code of each candidate word in the N candidate words corresponding to the first differential text is the same as the position code of the first differential text, and the position information of each candidate word in the N candidate words corresponding to the first differential text is the same as the position information of the first differential text. When the first difference text comprises a plurality of words, the position codes of the N candidate words corresponding to the first difference text are in one-to-one correspondence with the position codes of the first difference text, and the position information of the N candidate words corresponding to the first difference text is in one-to-one correspondence with the position information of the first difference text.

Referring to fig. 6, it is assumed that the first differential text is "body", the first sentence in which the first differential text is located is "body protocol is specifically below", the first sentence is extracted from the recognition result of the first document based on the location information of the first differential text, and it is assumed that N is 3, that is, three candidate words corresponding to the "body" of the first differential text are "book", "the" and "the" respectively, and these three candidate words are possible real contents in the location where the first differential text is located. The first image is an image corresponding to the first sentence "the body protocol is specifically as follows".

Three inputs at the first text encoding end are described in connection with fig. 6: the word vectors corresponding to the N candidate words fused by the first single sentence are as follows: as shown in fig. 6, the "CLS body-specific" protocol is that the first sentence after 3 candidate words are introduced is obtained according to the dictionary, and word vectors corresponding to each word are { E _CLS,E_Body,E_{The book is provided with},E_{The method comprises},…,E_{Lower part(s)} }; the corresponding position codes of N candidate words are fused in the first sentence and are used for representing the position relation between the words in the sentence, as shown in the figure, the corresponding position numbers of the CLS body corresponding to the protocol are {0,1,2,3, …,10}, the corresponding position codes of each word in the CLS body corresponding to the protocol are { E ₀,E₁,E₂,E₃,…,E₁₀ }, and it is required to be explained that the position codes of the candidate words are identical to the position codes of the first difference text, namely E ₁＝E₂＝E₃＝E₄; the position information corresponding to the N candidate words after the first sentence is fused includes coordinate information (an abscissa x and an ordinate y) of each word in the sentence, a height h and a width w, where the coordinate information, the height and the width of each word are extracted from an OCR recognition result, it can be understood that a pixel value of each word in the OCR recognition result may be set to "0", a pixel value of a background is set to "1", each word is displayed as white, the background is displayed as black, so the height h of the word is a height of a minimum circumscribed rectangle of the word, the width w of the word is a width of the minimum circumscribed rectangle of the word, the abscissa x of the word is a column coordinate of a central coordinate of the minimum circumscribed rectangle of the word, and the ordinate y of the word is a row coordinate of a central coordinate of the minimum circumscribed rectangle of the word. The location information of the candidate word is the same as the location information of the first difference text.

In some possible embodiments, when the first differential text includes a plurality of words, the position codes and the position information of the candidate words are respectively in one-to-one correspondence with the position codes and the position information of the first differential text. For example, assuming that the first difference text is "paralysis", and one candidate word corresponding to the first difference text is "dysentery", when the position code and the position information corresponding to the first sentence corresponding to the first difference text after the candidate word is fused are obtained, the position code of "dysentery" in the candidate word is the same as the position code of "paralysis", the position information of "dysentery" in the candidate word is the same as the position information of "paralysis", the position code of "disease" in the candidate word is the same as the position code of "paralysis", and the position information of "disease" in the candidate word is the same as the position information of "blood.

After three inputs of the first text encoding end are obtained, a matrix M is set so that the introduced candidate words only affect the first difference text in the first sentence, but do not affect words in the first sentence except the first difference text. As shown on the right side of fig. 6, the size of matrix M is 11 x 11, the dark boxes in M represent corresponding words and are visible between words, and the corresponding values in the dark boxes can be set to "0"; the light-colored boxes in matrix M represent that the corresponding words are not visible from word to word, and the corresponding values in the light-colored boxes may be set to "- ≡". The number of the candidate word "book" is 2, the number of the candidate word "this" is 3, the number of the candidate word "this" is 4, the number of the first differential text "body" is 1, it can be seen that M (1, 2) =m (2, 1) =1, that is, the description "body" and "book" are visible to each other, and M (5, 2) =m (2, 5) = - ≡, that is, the description "book" and "co" are invisible to each other, therefore, three candidate words are set to be visible to the first differential text, and the candidate words are set to be invisible to other words except the first differential text in the first single sentence, so that the candidate words only have an influence on the first differential text in the first single sentence, and the influence of the candidate words on the context in which the first differential text is located is avoided.

In some possible embodiments, when the first differential text comprises a plurality of words, then the candidate word also comprises a plurality of words, in which case the words in the candidate word only have an effect on the words in the first differential text that correspond to the words. For example, assuming that the first difference text is "dysentery", and the plurality of candidate words corresponding to the first difference text are "paralyzed" and "paralyzed", the "paralyzed" is set to be visible only for "dysentery", and the "paralysis" is set to be visible only for "diseases".

Accordingly, the encoded representation of the first text encoding side also needs to be modified accordingly, and the self-attention formula of the first text encoding side is formula (2):

Wherein QK ^T +m represents the similarity (i.e., weight) between each Q vector (feature of the target word) and each K vector (feature of the context information), the influence factor of the candidate word is introduced in the similarity calculation, and when the Q vector corresponding to the target word is visible with the K vector corresponding to a certain word of the context information, M is 0, and QK ^T+MQK^T; when the Q vector corresponding to the target word is invisible to the K vector corresponding to a word of the context information, M is- ≡, and QK ^T +M is-fact. In response to this, the control unit, Distribution (i.e. normalized weight) representing similarity between each target word and each context information, when Q vector corresponding to a certain target word is invisible to K vector of a certain word, M is- ++and/>If the word is 0, no correlation exists between the target word and the word, so that the introduced candidate word only affects the first difference text. And (3) representing weighted summation of the value vectors at the Attention (Q, K, V, M) so as to obtain enhanced semantic vectors of all the target words, wherein the enhanced vectors of all the target words are the first fusion characteristics.

For equation (2) and equation (3), when the Q vector corresponding to a certain target word is invisible to the K vector of a certain word, M (i, j) is set to- ≡,If 0, there is no correlation between the target word and the word, in other words, there is no correlation between the target word and the word is 0. In some possible embodiments, M (i, j) may also be set to other values when the Q vector corresponding to a target word is not visible to the K vector of a word, such that/>For the preset value, the preset value may be 0.1, 0.2, 0.01, 0.001 or other values, so that the correlation between the target word and the word is also the preset value, so that the correlation between the target word and the word is as small as possible or negligible.

It should be noted that the first text encoding end may be the BERT in fig. 3A, and the structure of the first text encoding end may refer to the BERT structure in fig. 3A.

In summary, the first text encoding end obtains the first fusion feature based on the three inputs (the word vector corresponding to the first single sentence fusion candidate word, the position code corresponding to the first single sentence fusion candidate word, and the position information corresponding to the first single sentence fusion candidate word) obtained above.

The first image coding end in the first feature extraction network is used for extracting image feature information of a first single sentence where the first difference text is located. The first image encoding side may employ a convolutional neural network (Convolutional neuron nrtwork, CNN). CNNs include an input layer, a convolutional layer/pooling layer, where the pooling layer is optional, a fully-connected layer, and an output layer. The input layer represents input data, the input data is an image of a first single sentence where a first difference text is located, and the output data of the output layer is a first image feature.

The convolutional layer/pooling layer may include a plurality of convolutional operators, which function as filters for extracting specific information from input data, and the pooling layer functions to reduce the number of training parameters and increase the training speed of the network, so that the pooling layer is often periodically added after the convolution. The convolution layers and the pooling layers may be adjacently spaced, i.e. each convolution layer may be followed by one pooling layer, or a plurality of convolution layers may be followed by one pooling layer. And the full-connection layer is always connected with one or more full-connection layers after passing through the plurality of convolution layers/pooling layers, each neuron in the full-connection layer is fully connected with all neurons in the previous layer, and is used for integrating local information with category differentiation in the convolution layers or pooling layers, and the output value of the last full-connection layer is transmitted to the output layer. In the application, the output layer directly outputs the first image features received from the full connection layer without any processing. In some possible embodiments, the first image encoding end may also be a network for acquiring characteristics of an image corresponding to the text in the OCR recognition network.

Referring to fig. 5, after the first fusion feature output by the first text encoding end and the first image feature output by the first image encoding end are obtained, the first fusion feature and the first image feature are input into a first depth model of a first feature extraction network to perform feature extraction so as to obtain a first feature vector, and it can be seen that the first feature vector is obtained by fusing text, position and image information for the first depth learning model, and the first feature vector can better represent the overall feature of the first sentence fused with the candidate word corresponding to the first difference text. In some possible embodiments, after the first fusion feature and the first image feature are obtained, the first fusion feature and the first image feature are stitched, and the vector obtained after stitching is input into the first deep learning model to extract the first feature vector.

The first deep learning model may be an LSTM model, which is a time recurrent neural network, and may better describe sequence data with spatio-temporal correlation, and the relevant description of the LSTM model may refer to the above description and will not be repeated here.

It can be seen that, after N candidate words corresponding to the first difference text are obtained in the error correction process, the first feature vector may be obtained through the first feature extraction network. If the target difference item formed by the first difference text and the second difference text is a real difference item, a second feature vector is obtained through a second feature extraction network. It should be noted that, if the first difference text determines N candidate words in the error correction process, whether the second difference text needs to be subjected to error correction or not, the same method needs to be adopted to determine N candidate words corresponding to the second difference text from the candidate word list, that is, the scores of the prediction results are calculated based on the font similarity between the prediction results in the candidate word list and the second difference text and the probabilities corresponding to the prediction results, and then the prediction results corresponding to the first N higher scores are determined from the obtained scores of the plurality of prediction results to be used as N candidate words corresponding to the second difference text. The second feature extraction network may extract a second feature vector based on the word vector corresponding to the second sentence in which the N candidate words are fused, the position code corresponding to the second sentence in which the N candidate words are fused, the position information corresponding to the second sentence in which the N candidate words are fused, and the second image of the second sentence.

In one embodiment, if N candidate words corresponding to the first differential text are determined, but the second differential text is identified correctly (i.e., the second probability corresponding to the second differential text is greater than the first threshold), it is indicated that the second differential text does not have a plurality of candidate words corresponding to the second differential text. In another implementation, zero padding operation may be performed on the second sentence corresponding to the second difference text, so that the length of the second sentence after zero padding is the same as the length of the first sentence after N candidate words are introduced.

In one embodiment, if N candidate words corresponding to the first differential text are determined, and the second differential text obtains a corrected second differential text in the error correction process, that is, the second differential text only corresponds to one candidate word, in order to align the input of the first text encoding end with the input of the second text encoding end, in this embodiment, in this case, (N-1) candidate words may be selected from the candidate word list according to the scoring result of the prediction result, so that the number of candidate words corresponding to the second differential text is N. In another implementation, the zero-filling operation may be performed on the second sentence after one candidate word is introduced, so that the length of the second sentence after zero filling is the same as the length of the first sentence after N candidate words are introduced.

The structure of the second feature extraction network is the same as that of the first feature extraction network, namely the second feature extraction network comprises a second text encoding end, a second image encoding end and a second deep learning model, wherein the output of the second text encoding end and the output of the second image encoding end are respectively connected with the input of the second deep learning model. The input of the second text coding end is word vectors corresponding to the N candidate words fused by the second single sentence, position codes corresponding to the N candidate words fused by the second single sentence and position information corresponding to the N candidate words fused by the second single sentence, and the output of the second text coding end is a second fusion characteristic; the input of the second image coding end is an image of a second single sentence where the second difference text is located, and the output of the second image coding end is a second image feature. The second deep learning model is used for fusing the second fused feature and the second image feature to output a second feature vector. The implementation process of each part in the second feature extraction network may refer to the implementation process of each part in the first feature extraction network, which is not described herein for brevity.

Finally, after obtaining the first feature vector output by the first feature extraction network and the second feature vector output by the second feature extraction network, performing linear processing on the first feature vector and the second feature vector to obtain a third feature vector, for example, assuming that the first feature vector is U and the second feature vector is V, performing linear processing on the first feature vector and the second feature vector may be: the vector obtained by adding U and V together with the absolute value of the vector obtained by subtracting U and V together to obtain a third feature vector, namely U+V+|U-V| is taken as the third feature vector, and it is to be noted that the linear processing enables the third feature vector to fuse the relation between the first feature vector and the second feature vector. The third feature vector is input into a classifier of the difference term discrimination model, and the classifier can be softmax, which performs two classifications on the third feature vector to determine whether the target difference term is a real difference term.

In one implementation, the softmax performs two classifications on the third feature vector, and outputs the probability that the target difference item is the real difference item and the probability that the target difference item is not the real difference item respectively, and when the probability that the target difference item is the real difference item is greater than the probability that the target difference item is not the real difference item, the target difference item is determined to be the real difference item; when the probability that the target difference item corresponds to the real difference item is smaller than the probability that the target difference item does not correspond to the real difference item, determining that the target difference item is not the real difference item.

It can be seen that by implementing the embodiment of the application, the judgment of whether the target difference item is a real difference item is realized through the deep learning model fused with the text, the position and the image information of the target difference item, the candidate word characteristics corresponding to the target difference item output by the language prediction model corresponding to the target difference item are introduced, the coding mode of the text and the position information of the target difference item is improved, the non-real difference item (caused by OCR recognition errors) can be effectively filtered, the recognition accuracy of the real difference item in document comparison is improved, and the accuracy of the document comparison result is improved.

The following describes the training process of the difference term discrimination model.

First, sample data is prepared, and sample data of positive difference item samples and sample data of negative difference item samples are obtained from multiple groups of documents corresponding to the same type (e.g., contract), wherein the positive difference item samples represent real difference items, and the negative difference item samples represent non-real difference items. The positive difference item sample and the negative difference item sample can be marked manually, and the positive difference item sample and the negative difference item sample can be marked by manually comparing the recognition results with the corresponding document original for the OCR recognition results of the two documents to be compared.

Wherein the sample data of the positive difference term sample includes: the text and the position information of the single sentence corresponding to the positive difference item sample and the image of the single sentence corresponding to the positive difference item sample, wherein the position information of the single sentence corresponding to the positive difference item comprises the abscissa, the ordinate, the width and the height of each word in the single sentence, and the data corresponding to the positive difference item is obtained from the OCR recognition result of the document corresponding to the positive difference item. Similarly, sample data for a negative difference term sample includes: text of a single sentence corresponding to the negative difference item sample, position information and an image of the single sentence corresponding to the negative difference item sample. It will be appreciated that either the positive or negative difference term samples are paired.

The method comprises the steps of sequentially inputting sample data of positive difference item samples and sample data of negative difference item samples into a difference item judging model to train the difference item judging model, taking the sample data of the positive difference item samples as an example, inputting the sample data of a first difference item in a group of positive difference item samples into a first feature extraction network of the difference item judging model to extract a first feature vector, inputting the sample data of a second difference item in the group of positive difference item samples into a second feature extraction network of the difference item judging model to extract a second feature vector, carrying out linear processing on the first feature vector and the second feature vector to obtain a third feature vector, carrying out two-class classification on the third feature vector by a classifier of the difference item judging model to output a classified result, calculating loss according to labels corresponding to the positive difference item samples, and training the difference item judging model according to the loss so that the classified result of the difference item judging model on input data is consistent with the labels corresponding to the input data.

Referring to fig. 7, a difference term discriminating method provided in the embodiment of the present application is described below, and includes, but is not limited to, the following steps:

S101, acquiring a target difference item, wherein the target difference item comprises a first difference text and a second difference text.

In one embodiment, the target difference term is obtained according to the recognition result of the first sentence and the recognition result of the second sentence. In the target difference item, the number of words included in the first difference text is the same as the number of words included in the second difference text, in other words, the length of the first difference text is equal to the length of the second difference text. The first sentence and the second sentence are from the associated two documents, respectively.

S102, determining a first probability corresponding to the first difference text and a second probability corresponding to the second difference text based on the language prediction model.

In one implementation, a target sentence corresponding to a target difference item is extracted, wherein the target sentence comprises a public item and a shielding item, the public item is the same text in a first single sentence where a first difference text is located and a second single sentence where a second difference text is located, and the shielding item is an item to be predicted. And then, inputting the target sentence into a language prediction model to predict the shielding item, obtaining a candidate word list corresponding to the shielding item, wherein the candidate word list comprises a plurality of prediction results and probabilities corresponding to the plurality of prediction results, searching a first difference text in the candidate word list and obtaining a first probability corresponding to the first difference text, and similarly, searching a second difference text in the candidate word list and obtaining a second probability corresponding to the second difference text. The language prediction model may be a BERT-based language prediction model. The relevant specific process may refer to the above description, and will not be repeated here.

S103, judging whether the first probability and the second probability are both larger than or equal to a first threshold value.

In one implementation, when the first probability and the second probability are both greater than or equal to the first threshold, S104 is executed; when both the first probability and the second probability do not satisfy the first threshold or more, S105 is executed. Wherein the first threshold is preset according to manual experience.

S104, determining the target difference item as a real difference item.

In one implementation, when the first probability and the second probability are both greater than or equal to the first threshold, it is indicated that the first difference text and the second difference text are both recognized correctly, so that the target difference item is determined to be a true difference item.

S105, when the existence target probability is smaller than a first threshold value, determining N candidate words corresponding to the first difference text and N candidate words corresponding to the second difference text from a candidate word list according to a first preset condition.

In one implementation, when the target probability is smaller than the first threshold, the target probability is the first probability or the second probability, that is, when the first probability is smaller than the first threshold or the second probability is smaller than the first threshold, the N candidate words corresponding to the first difference text and the N candidate words corresponding to the second difference text are determined according to the first preset condition.

Specifically, when any one of the first probability and the second probability is smaller than a first threshold, calculating the score of each prediction result according to the font similarity between each prediction result and the first difference text in the candidate word list and the probability corresponding to the prediction result, and sorting the scores of the plurality of prediction results obtained based on the first difference text in a descending order, wherein the first preset condition can be that the prediction result corresponding to the first N higher scores is taken, so that N candidate words corresponding to the first difference text are obtained. And for the second difference text, calculating the score of each prediction result according to the font similarity between each prediction result and the second difference text in the candidate word list and the probability corresponding to the prediction result, and sorting the scores of a plurality of prediction results obtained based on the second difference text in a descending order, wherein the first preset condition can be that the prediction results corresponding to the first N higher scores are taken, so that N candidate words corresponding to the second difference text are obtained.

S106, acquiring first data corresponding to the first difference text and second data corresponding to the second difference text, and judging whether the target difference item is a real difference item or not based on the first data and the second data by using a difference item judging model.

Specifically, first data corresponding to the first difference text and second data corresponding to the second difference text are obtained, and whether the target difference item is a real difference item is judged based on the first data and the second data by using a difference item judging model. The difference term discrimination model may refer to the above related description, and will not be described herein.

The first data comprises first information corresponding to a first single sentence where the first difference text is located and a first image of the first single sentence, and the second data comprises second information corresponding to a second single sentence where the second difference text is located and a second image of the second single sentence. The first information comprises word vectors of N candidate words corresponding to the first difference text fused by the first single sentence, position codes of N candidate words corresponding to the first difference text fused by the first single sentence and position information of N candidate words corresponding to the first difference text fused by the first single sentence, and the second information comprises word vectors of N candidate words corresponding to the second difference text fused by the second single sentence, position codes of N candidate words corresponding to the second difference text fused by the second single sentence and position information of N candidate words corresponding to the second difference text fused by the second single sentence.

For the first information, the word vector after the first sentence merges the N candidate words corresponding to the first differential text includes the word vector of each word in the first sentence and the word vector of the N candidate words corresponding to the first differential text, the position code corresponding to the first sentence after the first sentence merges the N candidate words corresponding to the first differential text includes the position code of each word in the first sentence and the position code of the N candidate words corresponding to the first differential text, and the position information corresponding to the first sentence after the first sentence merges the N candidate words corresponding to the first differential text includes the position information of each word in the first sentence and the position information of the N candidate words corresponding to the first differential text. The description of each part in the second information may refer to the related description in the first information, which is not repeated herein.

The specific process of determining whether the target difference term is the true difference term based on the first data and the second data in the step S106 by using the difference term determining model may refer to the above description, and will not be described herein.

In some possible embodiments, when the target probability is less than the first threshold, S105 may be not performed, and S106 may be directly performed, where the first information includes only a word vector corresponding to each word in the first sentence, a position code of each word in the first sentence, and position information of each word in the first sentence, and the second information includes only a word vector of each word in the second sentence, a position code of each word in the second sentence, and position information of each word in the second sentence, since N candidate words corresponding to the first differential text and N candidate words corresponding to the second differential text are not determined. And judging the target difference item based on the first data and the second data by using a difference item judging model.

It can be seen that by implementing the embodiment of the application, the probability corresponding to each item in the target difference item is obtained through the language prediction model, the discrimination of the target difference item is realized based on the probability corresponding to each item in the target difference item, and the real difference item can be effectively determined. For the target difference item which cannot be determined by the predictive prediction model, whether the target difference item is a real difference item or not is judged by fusing the context information, the image, the candidate word characteristics corresponding to the target difference item and other deep learning models of the target difference item, the non-real difference item (caused by OCR recognition errors) can be effectively filtered, and the accuracy of the real difference item in the document comparison result is improved, so that the accuracy of the document comparison result is improved.

Referring to fig. 8, fig. 8 is a schematic diagram illustrating another difference term determining method according to the present application, and it should be noted that the embodiment of fig. 8 may be independent of the embodiment of fig. 7 or may be complementary to the embodiment of fig. 7. The method includes, but is not limited to, the steps of:

S201, acquiring a target difference item, wherein the target difference item comprises a first difference text and a second difference text. The step may be specifically described with reference to S101 in the embodiment of fig. 7, which is not described herein.

S202, determining a first probability corresponding to the first difference text and a second probability corresponding to the second difference text based on the language prediction model. The step may be specifically described with reference to S102 in the embodiment of fig. 7, which is not described herein.

S203, judging whether the first probability and the second probability are both larger than or equal to a first threshold value. The step may be specifically described with reference to S103 in the embodiment of fig. 7, which is not described herein.

S204, determining the target difference item as a real difference item. The step may be specifically described with reference to S104 in the embodiment of fig. 7, which is not described herein.

S205, when the probability of existence of the target is smaller than a first threshold value, determining candidate words corresponding to the target text from a candidate word list according to a second preset condition.

Specifically, determining a candidate word corresponding to a target text from a candidate word list according to a second preset condition, wherein the target text is a text corresponding to a target probability, for example, when the target probability is a first probability, the target text is a first difference text; when the target probability is the second probability, the target text is the second difference text. The second preset condition is that the score of the candidate word in the candidate word list is larger than or equal to a second threshold value, and the score of the candidate word is determined by the probability corresponding to the candidate word and the font similarity between the candidate word and the target text.

In one embodiment, when only the first probability is smaller than the first threshold, a candidate word corresponding to the first difference text is determined from the candidate word list according to the second preset condition. In one embodiment, when only the second probability is smaller than the first threshold, a candidate word corresponding to the second difference text is determined from the candidate word list according to a second preset condition. In one implementation, if the first probability is smaller than the first threshold and the second probability is smaller than the first threshold, determining candidate words corresponding to the first difference text and candidate words corresponding to the second difference text from the candidate word list respectively.

S206, judging whether the number of the candidate words is 1.

In one implementation, when the number of candidate words is 1, S107 is performed; when the number of candidate words is not 1, S108 is performed.

S207, correcting the target text in the target difference item, and judging whether the corrected target difference item is a real difference item.

In one embodiment, when only the first probability is smaller than the first threshold, determining that the number of candidate words corresponding to the first difference text is 1, and taking the candidate words as the corrected first difference text. Then comparing the second difference text with the corrected first difference text, and determining that the corrected target difference item is not a real difference item when the second difference text is the same as the corrected second difference text; when the second difference text is different from the corrected second difference text, the corrected target difference item is determined to be a real difference item.

In one implementation, when only the second probability is smaller than the first threshold, determining that the number of candidate words corresponding to the second difference text is 1, and taking the candidate words as corrected second difference text. Then comparing the first difference text with the corrected second difference text, and determining that the corrected target difference item is not a real difference item when the first difference text is the same as the corrected second difference text; when the first difference text is different from the corrected second difference text, the corrected target difference item is determined to be a real difference item.

In one implementation, if the first probability is smaller than the first threshold and the second probability is smaller than the first threshold, and the number of candidate words corresponding to the first difference text is determined to be 1 and the number of candidate words corresponding to the second difference text is determined to be 1, in this case, the candidate word corresponding to the first difference text is used as a corrected first difference text, the corresponding candidate word of the second difference text is used as a corrected second difference text to be corrected, then the corrected first difference text and the corrected second difference text are compared, and when the corrected first difference text and the corrected second difference text are the same, it is determined that the corrected target difference item is not a real difference item; when the corrected first difference text and the corrected second difference text are different, the corrected target difference item is determined to be a real difference item.

It should be noted that, the process of correcting the target text may refer to the above description about the error correction process, and for brevity of description, details are not repeated here.

S208, determining N candidate words corresponding to the first difference text and N candidate words corresponding to the second difference text from the candidate word list according to a first preset condition. The step may be specifically described with reference to S105 in the embodiment of fig. 7, which is not described herein.

S209, acquiring first data corresponding to the first difference text and second data corresponding to the second difference text, and judging whether the target difference item is a real difference item based on the first data and the second data by using a difference item judging model. The step may be specifically described with reference to S106 in the embodiment of fig. 7, which is not described herein.

It can be seen that by implementing the embodiment of the application, the probability corresponding to each item in the target difference item is obtained through the language prediction model, the discrimination of the target difference item is realized based on the probability corresponding to each item in the target difference item, and the real difference item can be effectively determined. In addition, the true and false target difference items which cannot be determined are further combined with the similarity between each item in the target difference items and each prediction result output by the language prediction model to judge, so that the true difference items can be effectively determined, and the accuracy of the document comparison results is improved. The method has the advantages that whether the target difference item is a real difference item is judged through the deep learning model fused with the text, the position and the image information corresponding to the target difference item, candidate word characteristics corresponding to the target difference item output by the language prediction model corresponding to the target difference item are introduced, the coding mode of the text and the position information of the target difference item is improved, non-real difference items (caused by OCR recognition errors) can be effectively filtered, the recognition accuracy of the real difference item in document comparison is improved, and therefore the accuracy of a document comparison result is improved.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a computing device according to an embodiment of the present application, and the computing device 20 includes at least a processor 201, a memory 202, a communication interface 203, and a bus 200. In some possible embodiments, computing device 20 also includes an input/output interface 204. Wherein the memory 202, the communication interface 203, and the processor 201 are connected or coupled by a bus 200. The computing device 20 may be the execution apparatus 110 or the training apparatus 120 in the embodiment of fig. 1.

Specific implementation of the processor 201 in executing each operation may refer to the specific operations of obtaining the target difference term, determining the first probability and the second probability, and using the difference term discrimination model to discriminate the target difference term in the above method embodiment. The processor 201 may be comprised of one or more general purpose processors, such as a central processing unit (Central Processing Unit, CPU), or a combination of CPU and hardware chips. The hardware chip may be an Application-specific integrated Circuit (ASIC), a programmable logic device (Programmable Logic Device, PLD), or a combination thereof. The PLD may be a complex Programmable Logic device (Complex Programmable Logic Device, CPLD), a Field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a general-purpose array Logic (GENERIC ARRAY Logic, GAL), or any combination thereof.

Memory 202 may include Volatile Memory (Volatile Memory), such as random access Memory (Random Access Memory, RAM); the Memory 202 may also include a Non-Volatile Memory (Non-Volatile Memory), such as Read-Only Memory (ROM), flash Memory (Flash Memory), hard disk (HARD DISK DRIVE, HDD), or Solid state disk (Solid-state disk-STATE DRIVE, SSD); the memory 202 may also include combinations of the above. The memory 202 may store programs and data, wherein the stored programs include: language prediction model, difference term discrimination model, etc., the stored data includes: target difference term, location information, candidate vocabulary, etc. The memory 202 may be present alone or may be integrated within the processor 201.

The communication interface 203 may be a wired interface or a wireless interface. The wired interface may be an ethernet interface, a local interconnect network (Local Interconnect Network, LIN), etc., and the wireless interface may be a cellular network interface or a wireless local area network interface, etc. The communication interface 203 is used to enable communication between the computing device 20 and other devices (e.g., scanners, etc.).

In some possible embodiments, the computing apparatus 20 further includes an input/output interface 204, the input/output interface 204 being configured to interface with an input/output device for receiving input information and outputting processing results. The input/output device may be a mouse, keyboard, display, scanner or optical drive, etc.

Bus 200 is used to communicate information between various components of computing device 20, and bus 200 may be wired or wireless, as the application is not limited in this regard.

In an embodiment of the present application, the computing device 20 is configured to implement the method described in the embodiment of fig. 7 and 8.

Moreover, FIG. 9 is merely an example of a computing device 20, and the computing device 20 may include more or fewer components than shown in FIG. 9, or may have a different configuration of components. Meanwhile, various components shown in fig. 9 may be implemented in hardware, software, or a combination of hardware and software.

Referring to fig. 10, fig. 10 is a schematic functional structure diagram of a computing device according to an embodiment of the present application, and the computing device 30 includes an obtaining unit 310, a predicting unit 311, and a discriminating unit 312. In some possible embodiments, computing device 30 further includes a processing unit 313. The computing device 30 may be implemented in hardware, software, or a combination of hardware and software.

The obtaining unit 310 is configured to obtain a target difference item in a recognition result of a first sentence and a recognition result of a second sentence, where the target difference item includes a first difference text and a second difference text, the first sentence includes a public item and the first difference text, and the second sentence includes a public item and the second difference text; the prediction unit 311 is configured to determine a first probability corresponding to the first difference text and a second probability corresponding to the second difference text based on the language prediction model and the common term; the determining unit 312 determines whether the target difference item is a true difference item according to the first probability and the second probability. The processing unit 313 is configured to obtain first data corresponding to the first difference text and second data corresponding to the second difference text, and determine whether the target difference item is a real difference item based on the first data and the second data by using the difference item discrimination model.

The functional modules of the computing device 30 may be used to implement the method described in the embodiment of fig. 7. In the embodiment of fig. 7, the obtaining unit 310 may be used to execute S101, the predicting unit 311 may be used to execute S102, the discriminating unit 312 may be used to execute S103-S105, and the processing unit 313 may be used to execute S106. The functional modules of the computing device 30 may also be used to implement the method described in the embodiment of fig. 8, and are not described herein for brevity.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

It should be noted that, all or part of the steps in the various methods of the foregoing embodiments may be implemented by a program, which may be stored in a computer readable storage medium, including Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), one-time programmable Read-Only Memory (One-time Programmable Read-Only Memory, OTPROM), electrically erasable programmable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), or other optical disc Memory, magnetic disc Memory, tape Memory, or any other medium capable of being used for carrying or storing data.

The technical solution of the present application may be embodied essentially or partly or all of the technical solution or in the form of a software product stored in a storage medium, comprising several instructions for causing a device (which may be a personal computer, a server, or a network device, a robot, a single-chip microcomputer, a chip, a robot, etc.) to perform all or part of the steps of the method according to the embodiments of the present application.

Claims

1. A method for discriminating a difference term, the method comprising:

Acquiring a target difference item in a recognition result of a first single sentence and a recognition result of a second single sentence, wherein the target difference item comprises a first difference text and a second difference text, the first single sentence comprises a public item and the first difference text, and the second single sentence comprises the public item and the second difference text; the target sentences corresponding to the target difference items comprise the public items and the shielding items;

inputting the target sentence into a language prediction model to predict the shielding item, and outputting a candidate word list corresponding to the shielding item, wherein the candidate word list comprises a plurality of prediction results and probabilities corresponding to the plurality of prediction results;

Acquiring a first probability corresponding to the first difference text and a second probability corresponding to the second difference text from the candidate word list;

And judging whether the target difference item is a real difference item according to the first probability and the second probability.

2. The method of claim 1, wherein when the length of the first difference text is different from the length of the second difference text and the first difference text includes a semantically irrelevant word, the method further comprises:

removing the semantic irrelevant words in the first difference text so that the length of the first difference text is the same as the length of the second difference text.

3. The method of claim 2, wherein determining whether the target discrepancy item is a real discrepancy item based on the first probability and the second probability comprises:

and judging that the target difference item is a real difference item under the condition that the first probability is larger than or equal to a first threshold value and the second probability is larger than or equal to the first threshold value.

4. A method according to claim 3, wherein in case the first probability is smaller than the first threshold or the second probability is smaller than the first threshold, the method further comprises:

Acquiring first data corresponding to the first difference text and second data corresponding to the second difference text; the first data comprises first information corresponding to the first single sentence and a first image of the first single sentence, the second data comprises second information corresponding to the second single sentence and a second image of the second single sentence, the first information comprises a word vector of each word in the first single sentence, a position code of each word in the first single sentence and position information of each word in the first single sentence, and the second information comprises a word vector of each word in the second single sentence, a position code of each word in the second single sentence and position information of each word in the second single sentence;

And judging whether the target difference item is a real difference item or not based on the first data and the second data by using a difference item judging model.

5. The method of claim 4, wherein the difference term discrimination model includes a first feature extraction network, a second feature extraction network, a linear processing unit, and a classifier, the first feature extraction network being identical to the second feature extraction network, the determining whether the target difference term is a true difference term based on the first data and the second data using the difference term discrimination model comprising:

inputting the first information and the first image to the first feature extraction network to obtain a first feature vector;

inputting the second information and the second image to the second feature extraction network to obtain a second feature vector;

Inputting the first characteristic vector and the second characteristic vector to the linear processing unit to obtain a third characteristic vector;

And inputting the third feature vector to the classifier to obtain a classification result, wherein the classification result indicates whether the target difference item is a real difference item.

6. The method of claim 5, wherein the step of determining the position of the probe is performed,

The first feature extraction network comprises a first text encoding end, a first image encoding end and a first deep learning model, wherein the first text encoding end is used for outputting a first fusion feature according to the first information, the first image encoding end is used for outputting a first image feature according to the first image, and the first deep learning model is used for outputting the first feature vector according to the first fusion feature and the first image feature;

the second feature extraction network comprises a second text encoding end, a second image encoding end and a second deep learning model, wherein the second text encoding end is used for outputting a second fusion feature according to the second information, the second image encoding end is used for outputting a second image feature according to the second image, and the second deep learning model is used for outputting a second feature vector according to the second fusion feature and the second image feature.

7. The method of claim 6, wherein the step of providing the first layer comprises,

The first information further comprises word vectors, position codes and position information of N candidate words corresponding to the first difference text, and the first text coding end enables the correlation degree between the N candidate words corresponding to the first difference text and words except the first difference text in the first single sentence to be a preset value; n candidate words corresponding to the first difference text are determined from the candidate word list according to a first preset condition;

The second information further comprises word vectors, position codes and position information of N candidate words corresponding to the second difference text, and the second text coding end enables the correlation degree between the N candidate words corresponding to the second difference text and words except the second difference text in the second single sentence to be the preset value; n candidate words corresponding to the second difference text are determined from the candidate word list according to a second preset condition.

8. A method according to claim 3, wherein in case the first probability is smaller than the first threshold value and the second probability is larger than or equal to the first threshold value, the method further comprises:

Determining candidate words corresponding to the first difference text from the candidate word list according to a third preset condition;

When the number of the candidate words corresponding to the first difference text is 1, taking the candidate words corresponding to the first difference text as corrected first difference text;

Determining that the target difference term is not a true difference term when the second difference text is the same as the corrected first difference text;

And when the second difference text is different from the corrected first difference text, determining that the target difference item is a real difference item.

9. The method of claim 8, wherein the third preset condition is that a score of a candidate word in the candidate vocabulary is greater than a second threshold, the score of the candidate word being determined by a probability that the candidate word corresponds to and a font similarity between the candidate word and the first difference text.

10. A method according to claim 3, wherein in case the first probability is smaller than the first threshold value and the second probability is smaller than the first threshold value, the method further comprises:

determining candidate words corresponding to the second difference text from the candidate word list according to a fourth preset condition;

when the number of candidate words corresponding to the first difference text is 1 and the number of candidate words corresponding to the second difference text is 1, using the candidate words corresponding to the first difference text as corrected first difference text and using the candidate words corresponding to the second difference text as corrected second difference text;

determining that the target difference term is not a true difference term when the corrected first difference text is the same as the corrected second difference text;

And when the corrected first difference text is different from the corrected second difference text, determining that the target difference item is a real difference item.

11. The method of claim 10, wherein the third preset condition is that a score of a candidate word in the candidate word list is greater than a second threshold, the score of the candidate word being determined by a probability that the candidate word corresponds to and a font similarity between the candidate word and the first difference text; and the fourth preset condition is that the score of the candidate word in the candidate word list is larger than the second threshold value, and the score of the candidate word is determined by the probability corresponding to the candidate word and the font similarity between the candidate word and the second difference text.

12. The method of any of claims 1-6, wherein the first sentence and the second sentence are from two different documents that are associated.

13. An apparatus, the apparatus comprising:

The device comprises an acquisition unit, a recognition unit and a recognition unit, wherein the acquisition unit is used for acquiring target difference items in a recognition result of a first single sentence and a recognition result of a second single sentence, the target difference items comprise a first difference text and a second difference text, the first single sentence comprises a public item and the first difference text, and the second single sentence comprises the public item and the second difference text; the target sentences corresponding to the target difference items comprise the public items and the shielding items;

The prediction unit is used for inputting the target sentence into a language prediction model to predict the shielding item, and outputting a candidate word list corresponding to the shielding item, wherein the candidate word list comprises a plurality of prediction results and probabilities corresponding to the plurality of prediction results; acquiring a first probability corresponding to the first difference text and a second probability corresponding to the second difference text from the candidate word list;

And the judging unit is used for judging whether the target difference item is a real difference item according to the first probability and the second probability.

14. A computer readable storage medium, characterized in that the computer readable storage medium stores program instructions for implementing the method of any one of claims 1-12.

15. A computing device comprising a memory and a processor, the memory for storing program instructions; the computing device performs the method of any of claims 1-12 when the processor executes program instructions in the memory.