CN111079432A

CN111079432A - Text detection method and device, electronic equipment and storage medium

Info

Publication number: CN111079432A
Application number: CN201911088731.5A
Authority: CN
Inventors: 陈利琴; 刘设伟
Original assignee: Taikang Insurance Group Co Ltd; Taikang Online Property Insurance Co Ltd
Current assignee: Taikang Insurance Group Co Ltd; Taikang Online Property Insurance Co Ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2020-04-28
Anticipated expiration: 2039-11-08
Also published as: CN111079432B

Abstract

The embodiment of the invention provides a text detection method, a text detection device, electronic equipment and a computer readable storage medium, belonging to the technical field of computers, wherein the text detection method comprises the following steps: analyzing a text to be detected to obtain a target text to be detected, and marking an identification text existing in the target text to be detected; classifying the target text to be detected through a trained classification model to determine a classification result; if the classification result belongs to a preset type, comparing the target text to be detected with the identification text, and performing compliance detection on the target text to be detected according to the comparison result; and if the classification result does not belong to the preset type, inputting the target text to be detected into a trained named entity recognition model to determine a text entity, and performing compliance detection on the target text to be detected according to a comparison result of the text entity and the identification text. The embodiment of the invention can improve the accuracy of text detection.

Description

Text detection method and device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a text detection method, a text detection device, electronic equipment and a computer readable storage medium.

Background

The document generally plays a role in reminding or emphasizing through thickening and the like, so how to recognize the thickened text is an important process.

In the related art, the detection of bold text generally uses a text classification method or a conventional rule matching method. The text classification can generally classify a sentence, a paragraph or a document, but cannot directly classify a word or a part of text in a sentence. Therefore, when recognizing a part of bold text, the text classification method may further need manual cooperation to check whether the text is correct, so that the recognition rate and the recognition efficiency are low, and whether the bold text is compliant cannot be accurately detected. The rule matching method does not consider semantic information of the text, so that the accuracy of bold text recognition is low.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present invention and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.

Disclosure of Invention

Embodiments of the present invention provide a text detection method, a text detection apparatus, an electronic device, and a computer-readable storage medium, so as to overcome the problems of low accuracy and low recognition efficiency in recognizing bold text to at least a certain extent.

Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the invention.

According to an aspect of an embodiment of the present invention, there is provided a text detection method, including: analyzing a text to be detected to obtain a target text to be detected, and marking an identification text existing in the target text to be detected; classifying the target text to be detected through a trained classification model to determine a classification result; if the classification result belongs to a preset type, comparing the target text to be detected with the identification text, and performing compliance detection on the target text to be detected according to the comparison result; and if the classification result does not belong to the preset type, inputting the target text to be detected into a trained named entity recognition model to determine a text entity, and performing compliance detection on the target text to be detected according to a comparison result of the text entity and the identification text.

In an exemplary embodiment of the present invention, before classifying the target text to be detected through the trained classification model, the method further includes: acquiring first sample data, and inputting the first sample data into an embedding layer to generate a corresponding word vector sequence; training the word vector sequence through a long-time and short-time memory network to obtain context characteristics; obtaining a feature vector of the word vector sequence by adopting maximum pooling operation on the context features; and inputting the feature vectors into a linear layer and a classification layer in sequence to obtain the trained classification model.

In an exemplary embodiment of the present invention, acquiring first sample data, and inputting the first sample data into an embedding layer to generate a corresponding word vector sequence includes: preprocessing a historical detection text, and determining the preprocessed historical detection text as a positive sample and a negative sample to obtain first sample data; obtaining word vectors according to the preprocessed historical detection texts; serializing the preprocessed historical detection texts to obtain serial historical detection texts; and constructing the embedding layer according to the word vectors and the sequence history detection text, and inputting the first sample data into the embedding layer to generate the word vector sequence.

In an exemplary embodiment of the present invention, before performing sequence tagging on the target text to be detected through a trained named entity recognition model to determine a text entity of the target text to be detected, the method further includes: acquiring second sample data, wherein the second sample data is obtained according to a sequence marking rule; inputting the second sample data into a long-term memory network to obtain the probability that each character in the second sample data is respectively labeled by a sequence label; and inputting the transition probability between the probability and the label into a conditional random field layer to carry out sentence-level sequence labeling so as to obtain the trained named entity recognition model.

In an exemplary embodiment of the present invention, marking the identification text existing in the target text to be detected includes: if the identification text is all texts in the target text to be detected, adding a first mark to the identification text, and determining the mark position of the identification text; and if the identification text is a part of text in the target text to be detected, adding a second mark to the identification text, and determining the mark position of the identification text.

In an exemplary embodiment of the present invention, comparing the target text to be detected with the identification text, and performing compliance detection on the target text to be detected according to a comparison result includes: acquiring an identification text in the target text to be detected according to the marking position; if the target text in the target text to be detected is consistent with the identification text, determining the target text compliance; and if the target text is not consistent with the identification text, determining that the target text is not compliant.

In an exemplary embodiment of the present invention, performing compliance detection on the target text to be detected according to a comparison result between the text entity and the identification text includes: acquiring a text entity according to the marked position of the text entity; if the text entity is consistent with the identification text, determining that the text entity is in compliance; and if the text entity is not consistent with the identification text, determining that the text entity is not in compliance.

According to an aspect of an embodiment of the present invention, there is provided a text detection apparatus including: the identification text determining module is used for analyzing the text to be detected to obtain a target text to be detected and marking the identification text existing in the target text to be detected; the text classification module is used for classifying the target text to be detected through the trained classification model so as to determine a classification result; the first detection module is used for comparing the target text to be detected with the identification text if the classification result belongs to a preset type, and performing compliance detection on the target text to be detected according to the comparison result; and the second detection module is used for inputting the target text to be detected into a trained named entity recognition model to determine a text entity if the classification result does not belong to the preset type, and performing compliance detection on the target text to be detected according to a comparison result of the text entity and the identification text.

According to an aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a text detection method as described in any one of the above.

According to an aspect of an embodiment of the present invention, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform any of the text detection methods described above via execution of the executable instructions.

In the text detection method, the text detection device, the electronic equipment and the computer-readable storage medium provided by the embodiment of the invention, the trained classification model is used for classifying the target text to be detected so as to determine whether the classification result belongs to a preset type; if the classification result belongs to the preset type, comparing the target text to be detected with the identification text to perform compliance detection; and if the classification result does not belong to the preset type, performing sequence labeling on the target text to be detected through the trained named entity recognition model to determine a text entity of the target text to be detected, and performing compliance detection on the target text to be detected according to the comparison result of the text entity and the identification text. On one hand, whether the text to be detected of the target belongs to the preset type can be determined through the trained classification model, and when the text to be detected of the target belongs to the preset type, the text to be detected of the target is compared with the originally marked identification text so as to carry out compliance detection, or when the text does not belong to the preset type, the text entity of the text to be detected of the target and the originally marked identification text are obtained according to the trained named entity recognition model so as to carry out compliance detection. On the other hand, the trained classification model and the named entity recognition model can realize the function of automatically recognizing the text, perform compliance detection on the target text to be detected, reduce the operation of manual examination and check, improve the efficiency and save the cost.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 schematically illustrates a flow diagram of a text detection method according to an embodiment of the invention;

FIG. 2 schematically illustrates a diagram of training a classification model of an embodiment of the invention;

FIG. 3 schematically illustrates a schematic diagram of generating a sequence of word vectors according to an embodiment of the invention;

fig. 4 schematically shows a flowchart for comparing a target text to be detected with a mark text according to an embodiment of the present invention;

FIG. 5 schematically illustrates a diagram of training a named entity recognition model according to an embodiment of the invention;

FIG. 6 is a schematic diagram that schematically illustrates comparing text entities with tagged locations that identify text, in accordance with an embodiment of the present invention;

FIG. 7 schematically illustrates an overall flow diagram of text compliance detection in accordance with an embodiment of the present invention;

FIG. 8 schematically shows a block diagram of a text detection apparatus according to an embodiment of the present invention;

fig. 9 schematically shows a block diagram of an electronic device for implementing the text detection method described above.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

In order to solve the above problem, an embodiment of the present invention first provides a text detection method, which may be applied in a processing scenario for performing compliance detection on texts in various documents. The main body of the text detection method may be a server, and as shown in fig. 1, the text detection method may include step S110, step S120, step S130, and step S140.

Wherein:

in step S110, analyzing the text to be detected to obtain a target text to be detected, and marking an identification text existing in the target text to be detected;

in step S120, classifying the target text to be detected through the trained classification model to determine a classification result;

in step S130, if the classification result belongs to a preset type, comparing the target text to be detected with the identification text, and performing compliance detection on the target text to be detected according to the comparison result;

in step S140, if the classification result does not belong to the preset type, inputting the target text to be detected into the trained named entity recognition model to determine a text entity, and performing compliance detection on the target text to be detected according to a comparison result between the text entity and the identification text.

In the technical solution provided in the exemplary embodiment of the present invention, on one hand, it may be determined whether the target text to be detected belongs to the preset type through the trained classification model, and when the target text to be detected belongs to the preset type, compliance detection may be performed by comparing the recognized target text to be detected with the originally labeled identification text, or when the target text to be detected does not belong to the preset type, compliance detection may be performed by comparing the text entity of the target text to be detected obtained according to the trained named entity recognition model with the originally labeled identification text. On the other hand, the trained classification model and the named entity recognition model can realize the function of automatically recognizing the text, perform compliance detection on the target text to be detected, reduce the operation of manual examination and check, improve the efficiency and save the cost.

Next, the text detection method in the embodiment of the present invention will be further explained with reference to the drawings.

In step S110, the text to be detected is analyzed to obtain a target text to be detected, and the identification text existing in the target text to be detected is marked.

In the embodiment of the present invention, the text to be detected may be an online text or an offline stored local text, and specifically may be any whole or partial document expressed in a text form, for example, a contract, a written document, a teaching document, and the like, which is not limited herein. The text to be detected can be in the form of a word document. The target text to be detected refers to the analyzed text to be detected. The identification text may be a text in the target text to be detected, which is different from other display forms, for example, a bold text, an italic text, an enlarged text, or other specially displayed texts in the target text to be detected.

The target text to be detected refers to a document after analysis processing, and the analysis processing mainly comprises sentence division and preprocessing operation on the text to be detected. The whole document can be quickly and accurately divided through a program. Further, after the sentence is divided, the text may be preprocessed. The preprocessing operations herein include, but are not limited to, operations of removing special characters, converting from uppercase to lowercase English, etc. By carrying out preprocessing operation on the text, the obtained target text to be detected can be more standard and easier to process. Based on this, in the subsequent processing, the target text to be detected refers to the text contained in each sentence.

After the target text to be detected is obtained, the identification text existing in the target text to be detected can be marked to obtain the marking position of the identification text, so that the position where the identification text is located can be accurately positioned and marked. Specifically, marking the identification text may include the following two cases: in case one, if the identification text is all texts in the target text to be detected, adding a first mark to the identification text, and determining the mark position of the identification text. That is, if all of the sentences are identification texts, a first flag may be added to the identification texts, and the first flag may be "1" or other numbers or letters, for example. For the labeled text added with the first label, since it may not be possible to determine whether some sentences are labeled texts according to the classification model, but the whole sentence may be recognized as one labeled text when the named entity recognition model is used, the sentence added with the first label needs to be processed as follows: the record identifies the start position and the end position of the text, and determines its mark position as a mark (0, length-1), where length represents the length of the sentence.

And secondly, if the identification text is a part of text in the target text to be detected, adding a second mark to the identification text, and determining the mark position of the identification text. That is, if only part of the sentence is the identification text, a second mark may be added to the identification text, and the second mark may be, for example, "0" or other number or letter different from the first mark, and so on. For the sentence added with the second mark, for the mark text (partially bold text) in the sentence, the start position and the end position of the mark text are recorded, and the mark position is (start _ index, end _ index), for example, when the sentence "insurance applicant requests compensation, the following proof and materials should be provided to the insurer. ", denoted (0, 5). In addition, for a text other than the logo text in the sentence, the start position and the end position are recorded, and the mark position is (None ).

In the embodiment of the invention, the position of the identification text can be accurately determined by marking the identification text, so that the target text to be detected can be accurately subjected to compliance detection based on the marked position.

In step S120, the target text to be detected is classified by the trained classification model to determine a classification result.

In the embodiment of the invention, the classification model is used for classifying the target text to be detected to obtain a classification result. The classification results may generally be all bold type, some bold type, and all non-bold type. Specifically, the text to be detected of the target can be input into the trained classification model, and a classification result of the text to be detected belonging to a certain type can be obtained. In order to obtain more accurate classification results, a trained classification model may be obtained before step S120.

The classification model training flow chart is schematically shown in fig. 2, and referring to fig. 2, the process of obtaining the trained classification model may include the following steps:

step S210, acquiring first sample data, and inputting the first sample data into an embedding layer to generate a corresponding word vector sequence.

In the embodiment of the present invention, the first sample data refers to a part of text for training a classification model, which is obtained from a history detection text, that is, the first sample data is text data of which it is known whether the text data is a bold text. The history detection text refers to a history detected document that has been known whether it is bold text.

After the first sample data is obtained, the first sample data may be input into an embedding layer to generate a corresponding word vector sequence, as shown in fig. 3, specifically including the following steps S310 to S340, where:

step S310, preprocessing the history detection text, and determining the preprocessed history detection text as a positive sample and a negative sample to obtain the first sample data.

In the embodiment of the invention, before the first sample data is acquired, the historical detection text can be processed. Specifically, the historical detection text may be divided into sentences, and the formatted historical detection text after the sentence division; and segmenting the formatted and processed historical detection text according to characters to obtain the preprocessed historical detection text. Specifically, all documents are divided into sentences to serve as linguistic data of a training model; loading the corpus to perform data formatting treatment, wherein the formatting comprises removing special characters, converting from complex writing to shorthand writing, converting from English upper case to lower case and the like; and dividing all the linguistic data after formatting processing according to characters to obtain a preprocessed historical detection text. In this way, the preprocessed history detection text is also a sentence divided by words.

After the preprocessed historical detection texts are obtained, first sample data can be obtained according to the texts. The first sample data may specifically comprise positive and negative samples. Specifically, sentences of the bold text including all the history detection texts may be labeled as positive samples, and other sentences may be labeled as negative samples.

For example, all terms in the insurance contract are extracted, and all insurance terms are divided into sentences to serve as initial linguistic data of the training model; and the generated text of each sentence is divided into three classes according to the fact that the whole sentence is bolded, the text of part of the sentence is bolded, and the whole sentence is not bolded, and the three classes are respectively marked as A _ Bold, P _ Bold and N _ Bold. And formatting the classified historical detection text, including removing special characters, frequently writing to be abbreviated, converting English capital into lowercase and the like. And dividing all the linguistic data after the formatting processing according to characters to obtain a preprocessed historical detection text. Next, the sentence (a _ Bold) in the preprocessed history detection text, which is all Bold, may be labeled as a positive sample, and the other sentences (P _ Bold, N _ Bold) may be all labeled as negative samples, so as to obtain the first sample data.

And step S320, obtaining a word vector according to the preprocessed historical detection text.

In the embodiment of the invention, the word vector is trained by the preprocessed historical detection text through a word2vec word embedding algorithm. After data preprocessing and data labeling are carried out on a training expected, word vectors need to be trained in advance, the word vectors are trained in advance through a word2vec word embedding algorithm on a preprocessed historical detection text which is divided according to words, and the word vectors are used for training a classification model, a named entity recognition model and an application stage. The network is represented by words and the input words in adjacent positions are guessed, and the order of the words is unimportant under the assumption of the bag-of-words model in word2 vec. After training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent word-to-word relationships, the vector being a hidden layer of the neural network.

And step S330, serializing the preprocessed historical detection texts to obtain a serial historical detection text.

In the embodiment of the present invention, serialization refers to a process of converting state information of an object into a form that can be stored or transmitted. The serialization is that the word divided by the word can be represented by a number, and further the sequence history detection text can be obtained.

Step S340, constructing an embedding layer according to the word vectors and the sequence history detection texts, and inputting the first sample data into the embedding layer to generate the word vector sequence.

In the embodiment of the invention, after the word vector and the sequence history detection text are obtained, an embedding layer of the classification model can be constructed according to the word vector and the sequence history detection text. The classification model is a neural network model and specifically comprises an embedding layer, a bidirectional long-time memory model (Bi-LSTM) layer, a maximum pooling layer, a linear layer and a classification layer softmax.

The linear embedding layer maps the input word vector into a distributed word vector through a shared matrix, that is, the embedding layer expresses words by using vectors. After the embedding layer is constructed, first sample data composed of labeled positive samples and labeled negative samples can be input into the embedding layer to generate word vector sequences corresponding to the positive samples and the negative samples.

Continuing to refer to fig. 2, in step S220, the word vector sequence is trained through a long-time memory network to obtain context features.

In the embodiment of the invention, the long-short time memory model is a bidirectional long-short time memory model and consists of a forward LSTM and a backward LSTM. The context characteristics of a word vector sequence (sentence) are extracted by utilizing a bidirectional long-time memory model, the process is a coding process, and the specific process comprises the following steps:

for the word vector sequence (x) of the sentence obtained in step S210₁,x₂,...x_n) After LSTM coding processing from left to right and from right to left are respectively carried out, the hidden layer state of each time step in two directions is respectively obtained, and the output of the forward hidden layer is recorded as

Backward hidden layer output as

The calculation formulas of the LSTM unit include formula (1) to formula (5):

i_t＝σ(W_xix_t+W_hih_t-1+W_cic_t-1+b_i) Formula (1)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f) Formula (2)

c_t＝f_tc_t-1+i_ttanh(W_xcx_t+W_hch_t-1+b_c) Formula (3)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t+b_o) Formula (4)

h_t＝o_ttanh(c_t) Formula (5)

Where σ is the logistic regression activation function, x_tIs to obtain the word vector at time t, i_t、f_t、o_tAn input gate, a forgetting gate, an output gate respectively representing time t, c_tAnd c_t-1Respectively representing the memory flow states of the cell units at time t and t-1, h_tRepresenting the hidden layer vector at time t. b_i、b_f、b_c、b_oRespectively an input gate, a forgetting gate and an output gate. The subscripts of the bias parameters, weight matrix W, of the memory cells are of particular interest, e.g. W_hiRepresenting the weight matrix connecting the hidden layer to the input gate.

In order to fully utilize the context information of each moment of the text, the forward information and the backward information of the hidden layer are spliced together to be used as the output of the hidden layer at the moment, and the output is expressed as follows:

in step S230, a maximal pooling operation is performed on the context features to obtain feature vectors of the word vector sequence.

In the embodiment of the present invention, the maximum pooling (max-pooling) is the point with the maximum value in the local acceptance domain. The max-posing operation is applied to the Bi-LSTM layer to obtain a feature representation of the input word vector sequence, which can extract the most useful features of the word vector sequence, i.e., feature vectors.

In step S240, the feature vectors are sequentially input into a linear layer and a classification layer to obtain the trained classification model.

In the embodiment of the invention, the feature vectors corresponding to the obtained word vector sequence are firstly input into the linear layer and then input into the classification layer so as to adjust the weight parameters of each layer of the neural network until the classification result of the first sample data is consistent with the result of manual classification, and then the trained classification model can be obtained.

Continuing to refer to fig. 1, in step S130, if the classification result belongs to a preset type, comparing the target text to be detected with the identification text, and performing compliance detection on the target text to be detected according to the comparison result.

In the embodiment of the present invention, on the basis of step S120, it may be determined whether the obtained classification result belongs to a preset type, and if the classification result is the preset type, the classification result may be marked with "1". The preset type may specifically represent a bold type, for example, all texts in the text to be detected of the target all belong to bold texts.

If the classification result of the target text to be detected belongs to the preset type, the target text to be detected can be compared with the identification text marked in the step S110, so as to perform compliance detection on the target text to be detected. The compliance detection here refers to judging whether the text to be detected by the target is a bold text or not and whether the bold text in the text to be detected by the target is correct or not. Fig. 4 schematically shows a flowchart for comparing a target text to be detected with an identification text, and referring to fig. 4, the method mainly includes steps S410 to S430, where:

step S410, obtaining the identification text according to the mark position, and judging whether the target text in the target text to be detected is consistent with the identification text.

In this step, the target text refers to a bold text recognized from the target text to be detected through the trained classification model, that is, the target text is a text marked as "1" in the classification result obtained according to the classification model. The identification text recognized as being marked in advance can be accurately acquired according to the marking position, for example, the identification text is acquired according to the mark (0, length-1) of the starting position and the ending position. And further comparing the obtained target text with the predetermined identification text to see whether the two texts are the same.

Step S420, if the target text is consistent with the identification text, determining that the target text is compliant.

In this step, if the text marked as "1" in the target text to be detected is "insurance applicant", and the bold text between the marks (0, 5) of the start position and the end position marked in advance is "insurance applicant", the two are considered to be the same, and it can be determined that the target text is compliant. In determining compliance, the label "compliance" may be used to indicate.

Step S430, if the target text is not consistent with the identification text, determining that the target text is not compliant.

In this step, if the text marked as "1" in the target text to be detected is "insurance applicant", and the bold text between the marks (0, 5) of the start position and the end position marked in advance is "insurance applicant", the two are considered to be different, and it can be determined that the target text is not compliant. In determining compliance, the label "not compliant" may be used to indicate.

It should be noted that the mark position of the target text may be compared with the mark position of the identification text, and whether the content is the same when the mark positions are the same may be determined, which is not described in detail herein.

Continuing to refer to fig. 1, in step S140, if the classification result does not belong to the preset type, inputting the target text to be detected into a trained named entity recognition model to determine a text entity, and performing compliance detection on the target text to be detected according to a comparison result between the text entity and the identification text.

In the embodiment of the invention, if the text to be detected of the target does not belong to the preset type, the compliance detection of the text to be detected of the target is continuously carried out. Specifically, the target text to be detected can be input into the trained named entity recognition model for sequence labeling to judge whether the text entity is obtained. If the text entity is identified, the text entity indicates that part of the text to be detected in the target text needs to be set as a preset type. For example, if there is a text entity in bold identified, it is indicated that a portion of the text in the sentence needs to be bolded. And further, performing compliance detection on the target text to be detected again according to the comparison result of the text entity and the identification text, and accurately identifying the target text to be detected by combining the classification model and the named entity identification model so as to accurately identify the bold text in the text to be detected and improve the accuracy of the compliance detection.

The named entity recognition model is used for recognizing entities with specific categories, such as names of people, places, organizations, proper nouns and the like, from a target text to be detected. The problem of named entity recognition models is generally abstracted as the problem of sequence labeling, which is the problem of assigning a specific label to each symbol in a sequence, essentially classifying each element in the sequence according to the context.

In order to obtain a more accurate classification result, a trained named entity recognition model may be obtained before step S140, so as to determine a text entity existing in the target text to be detected through the trained named entity recognition model. The text entity is determined by the target to be identified specifically, for example, if the identification target is a bold text, the text entity is a bold text entity; and if the recognition target is an italic text, the text entity is an italic text entity. In the embodiment of the present invention, a text entity is described as a bold text entity or a bold text entity.

Fig. 5 schematically illustrates a schematic diagram of training a named entity recognition model, and referring to fig. 5, the method mainly includes steps S510 to S530, where:

step S510, second sample data is obtained according to the sequence labeling rule.

In this step, the second sample data is used for training the named entity recognition model, and the second sample data is obtained by a sequence labeling mode. Specifically, the partially-bolded sentences (partially-bolded texts) in the preprocessed historical detection texts are labeled according to the BIO labeling mode labeled by the named entity identification data. The labels are in the form of B-BOLD, I-BOLD, O, representing the beginning symbol of BOLD text, the non-beginning symbol of BOLD text, non-BOLD text symbols, such as a sentence: when the insurance applicant requests compensation, the following proof and data should be provided to the insurer. Labeled as: B-BOLD, I-BOLD, I-BOLD, I-BOLD, I-BOLD, I-BOLD, O, O, O, O, O, O, O, O, O, O, O, O, O, O.

Step S520, inputting the second sample data into a long-time memory network so as to obtain the probability that each character in the second sample data is respectively labeled by a sequence label.

In this step, the named entity recognition model is composed of an embedding layer, a bidirectional long-and-short time memory model (Bi-LSTM) layer and a conditional random field layer. The input and output of the conventional neural network are independent, but in the sequence labeling, the following output and the preceding content are correlated, i.e. the output label is strongly dependent, so the embedding layer and the Bi-LSTM network structure are also adopted here, which are the same as the classification model. The second sample data obtained in step S510 may be input into the Bi-LSTM network structure, and the output value of the long-time memory network indicates that: probability that each word included in the second sample data is labeled with a sequence label (B-BOLD, I-BOLD, O).

Obtaining output of hidden layer

Directly accessing a linear layer for linear transformation, and transforming each hidden layer state into a vector with dimension of 1Xk

If a softmax function is directly accessed, sequence labeling is carried out on each vocabulary independently, however, the sequence labeling cannot be regarded as a simple classification problem, because each word is influenced mutually, if the sequence labeling is regarded as a classification problem, information loss is caused; it is the dependency between tags, such as I-BOLD followed by the start identifier of another entity, B-MOV, so a conditional random field layer is introduced to model the output of the entire sentence.

Step S530, inputting the transition probability between the probability and the label into a conditional random field layer to carry out sentence-level sequence labeling so as to obtain the trained named entity recognition model.

In this step, the transition probability between tags refers to the probability that a certain tag is converted into another tag, for example, the probability that a tag B-BOLD is converted into a tag I-BOLD. Accessing a conditional random field layer, and labeling sentence-level sequences by fusing the output value of the Bi-LSTM network structure and the transition probability between labels; the specific process comprises the following steps:

defining a sentence X with a sequence label l as:

wherein, A is the transition probability between labels of the conditional random field layer, as a learning parameter, f is the output value of the Bi-LSTM network structure, and the probability obtained by normalizing the above formula by the softmax layer is:

therefore, in training the model, the optimization goal is to minimize the log-likelihood function, as represented by:

when the model is predicted, the optimal path is solved by using a Viterbi algorithm of dynamic programming, and the formula is as follows:

and the conditional random field layer carries out sentence-level sequence marking by combining the output of the Bi-LSTM network structure and the transition probability among the labels until the marking result is consistent with the sequence marking of the historical detection text manually determined in advance so as to obtain a trained named entity recognition model with better performance.

Fig. 6 schematically shows that the step S610 to step S630 of performing compliance detection on the target text to be detected by using the comparison result between the text entity and the identification text includes:

step S610, acquiring the text entity according to the mark position of the text entity;

step S620, if the text entity is consistent with the identification text, the text entity is determined to be in compliance;

step S630, if the text entity is not consistent with the identification text, determining that the text entity is not compliant.

In the embodiment of the invention, bold text recognition is carried out through a trained named entity recognition model, and a label sequence is output; the tag sequence is further converted into a text entity, and a mark position of the text entity is obtained, which may include a start position and an end position. If there is no text entity, the start and end positions are marked as (None).

On the basis, the text entity can be quickly extracted according to the mark position of the text entity, the text entity is compared with the mark text obtained in the step S110, if the text entity is completely consistent with the mark text, the compliance of the text entity is determined, and a label of 'compliance' is returned. And if the text entity is not accordant with the identification text, determining that the text entity is not in compliance, and returning a label of 'non-compliance'.

In the embodiment of the invention, the trained classification model and the trained named entity recognition model are used for recognizing the target text to be detected according to the comparison result of the target text to be detected and the identification text, so that the bold text in the target text to be detected can be accurately recognized, and whether the bold text is in compliance or not can be accurately determined. In addition, whether the sentence is required to be thickened or not is detected by using the classification model, part of thickened texts in the sentence are detected by using the named entity identification method, the named entity identification method and the text classification method are combined, and the method is intelligently applied to the detection of the bold texts in the text compliance detection, so that the labor cost and the time cost for text verification are greatly reduced, and the operation efficiency is improved.

An overall flow chart for compliance detection of a text is schematically shown in fig. 7, and referring to the flow chart shown in fig. 7, mainly includes the following steps:

and S701, analyzing the text to be detected to obtain the target text to be detected.

Step S702, inputting the target text to be detected into the trained classification model to judge whether the target text is a preset type, wherein the preset type can be all bold texts.

Step S703, if the text to be detected is determined to be a bold text, comparing the bold text with the pre-marked identification text to determine whether the text to be detected is compliant.

Step S704, if the text to be detected is judged to be a non-bold text, inputting the text to be detected into the trained named entity recognition model.

Step S705, determining a text entity through the trained named entity recognition model. The text entity is a bold text entity.

Step S706, comparing the text entity with the marked identification text in advance to determine whether the text entity is in compliance.

According to the technical scheme in fig. 7, whether the text to be detected of the target belongs to the preset type can be determined through the trained classification model, and when the text to be detected of the target belongs to the preset type, the text to be detected of the target and the originally marked identification text are compared to perform compliance detection, or when the text does not belong to the preset type, the text entity of the text to be detected of the target and the originally marked identification text are compared to perform compliance detection, wherein the text is obtained through the trained named entity recognition model.

In an embodiment of the present invention, there is further provided a text detection apparatus, as shown in fig. 8, the apparatus 800 mainly includes:

the identification text determining module 801 may be configured to analyze a text to be detected to obtain a target text to be detected, and mark an identification text existing in the target text to be detected;

the text classification module 802 may be configured to classify the target text to be detected through the trained classification model to determine a classification result;

the first detection module 803 may be configured to, if the classification result belongs to a preset type, compare the target text to be detected with the identification text, and perform compliance detection on the target text to be detected according to a comparison result;

the second detecting module 804 may be configured to, if the classification result does not belong to the preset type, input the text to be detected into a trained named entity recognition model to determine a text entity, and perform compliance detection on the text to be detected according to a comparison result between the text entity and the identification text.

It should be noted that, the functional modules of the text detection apparatus according to the embodiment of the present invention are the same as the steps of the exemplary embodiment of the text detection method, and therefore, the description thereof is omitted here.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Moreover, although the steps of the methods of the present invention are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

In an exemplary embodiment of the present invention, there is also provided an electronic device capable of implementing the above method.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 900 according to this embodiment of the invention is described below with reference to fig. 9. The electronic device 900 shown in fig. 9 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present invention.

As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: the at least one processing unit 910, the at least one memory unit 920, and a bus 930 that couples various system components including the memory unit 920 and the processing unit 910.

Wherein the storage unit stores program code that is executable by the processing unit 910 to cause the processing unit 910 to perform steps according to various exemplary embodiments of the present invention described in the above section "exemplary methods" of the present specification. For example, the processing unit 910 may perform the steps as shown in fig. 1.

The storage unit 920 may include a readable medium in the form of a volatile storage unit, such as a random access memory unit (RAM)9201 and/or a cache memory unit 9202, and may further include a read only memory unit (ROM) 9203.

Storage unit 920 may also include a program/utility 9204 having a set (at least one) of program modules 9205, such program modules 9205 including but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 930 can be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 900 may also communicate with one or more external devices 1000 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 900, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 900 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 950. Also, the electronic device 900 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 960. As shown, the network adapter 960 communicates with the other modules of the electronic device 900 via the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the text detection method according to the embodiment of the present invention.

In an exemplary embodiment of the present invention, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.

According to the program product for realizing the method, the portable compact disc read only memory (CD-ROM) can be adopted, the program code is included, and the program product can be operated on terminal equipment, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

Claims

1. A text detection method, comprising:

analyzing a text to be detected to obtain a target text to be detected, and marking an identification text existing in the target text to be detected;

classifying the target text to be detected through a trained classification model to determine a classification result;

if the classification result belongs to a preset type, comparing the target text to be detected with the identification text, and performing compliance detection on the target text to be detected according to the comparison result;

and if the classification result does not belong to the preset type, inputting the target text to be detected into a trained named entity recognition model to determine a text entity, and performing compliance detection on the target text to be detected according to a comparison result of the text entity and the identification text.

2. The text detection method according to claim 1, wherein before the text to be detected is classified by the trained classification model, the method further comprises:

acquiring first sample data, and inputting the first sample data into an embedding layer to generate a corresponding word vector sequence;

training the word vector sequence through a long-time and short-time memory network to obtain context characteristics;

obtaining a feature vector of the word vector sequence by adopting maximum pooling operation on the context features;

and inputting the feature vectors into a linear layer and a classification layer in sequence to obtain the trained classification model.

3. The text detection method of claim 2, wherein obtaining first sample data and inputting the first sample data into an embedding layer to generate a corresponding word vector sequence comprises:

preprocessing a historical detection text, and determining the preprocessed historical detection text as a positive sample and a negative sample to obtain first sample data;

obtaining word vectors according to the preprocessed historical detection texts;

serializing the preprocessed historical detection texts to obtain serial historical detection texts;

and constructing the embedding layer according to the word vectors and the sequence history detection text, and inputting the first sample data into the embedding layer to generate the word vector sequence.

4. The text detection method according to claim 1, wherein before the text entity of the target text to be detected is determined by performing sequence labeling on the target text to be detected through the trained named entity recognition model, the method further comprises:

acquiring second sample data, wherein the second sample data is obtained according to a sequence marking rule;

inputting the second sample data into a long-term memory network to obtain the probability that each character in the second sample data is respectively labeled by a sequence label;

and inputting the transition probability between the probability and the label into a conditional random field layer to carry out sentence-level sequence labeling so as to obtain the trained named entity recognition model.

5. The text detection method according to claim 1, wherein marking the identification text existing in the target text to be detected comprises:

if the identification text is all texts in the target text to be detected, adding a first mark to the identification text, and determining the mark position of the identification text;

and if the identification text is a part of text in the target text to be detected, adding a second mark to the identification text, and determining the mark position of the identification text.

6. The text detection method according to claim 5, wherein comparing the target text to be detected with the identification text, and performing compliance detection on the target text to be detected according to a comparison result comprises:

acquiring an identification text in the target text to be detected according to the marking position;

if the target text in the target text to be detected is consistent with the identification text, determining the target text compliance;

and if the target text is not consistent with the identification text, determining that the target text is not compliant.

7. The text detection method according to claim 1, wherein performing compliance detection on the target text to be detected according to the comparison result between the text entity and the identification text comprises:

acquiring a text entity according to the marked position of the text entity;

if the text entity is consistent with the identification text, determining that the text entity is in compliance;

and if the text entity is not consistent with the identification text, determining that the text entity is not in compliance.

8. A text detection apparatus, comprising:

the identification text determining module is used for analyzing the text to be detected to obtain a target text to be detected and marking the identification text existing in the target text to be detected;

the text classification module is used for classifying the target text to be detected through the trained classification model so as to determine a classification result;

the first detection module is used for comparing the target text to be detected with the identification text if the classification result belongs to a preset type, and performing compliance detection on the target text to be detected according to the comparison result;

and the second detection module is used for inputting the target text to be detected into a trained named entity recognition model to determine a text entity if the classification result does not belong to the preset type, and performing compliance detection on the target text to be detected according to a comparison result of the text entity and the identification text.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the text detection method according to any one of claims 1 to 7.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the text detection method of any one of claims 1-7 via execution of the executable instructions.