CN114372463A

CN114372463A - Multi-language text error correction method based on sequence labeling model

Info

Publication number: CN114372463A
Application number: CN202210023205.6A
Authority: CN
Inventors: 李梅; 潘丽婷; 陈件; 张井
Original assignee: Shanghai Yizhe Information Technology Co ltd
Current assignee: Shanghai Yizhe Information Technology Co ltd
Priority date: 2022-01-10
Filing date: 2022-01-10
Publication date: 2022-04-19

Abstract

A multilingual text error correction method based on a sequence labeling model comprises the following steps: step S1: corpus data collection; step S2: generating data; taking multilingual Wikipedia corpus as a positive example, segmenting the sentences, randomly selecting 15% of words or characters, and deleting, inserting or replacing to generate a negative example; step S3: labeling data; generating a text error detection label and a correction behavior label; classifying and labeling the text error detection label and the correction behavior label respectively; step S4: training a text error correction sequence labeling model; step S5: and generating text correction alternatives. The invention overcomes the defects of the prior art and solves the problems of high cost and difficult maintenance of establishing a text error correction model for each language.

Description

Multi-language text error correction method based on sequence labeling model

Technical Field

The invention relates to the technical field of data processing, in particular to a multi-language text error correction method based on a sequence labeling model.

Background

With the increasing number of global non-native language learners, writers, and users, there is an increasing need for automated assessment of languages, such as text correction. The automatic text error correction needs to automatically correct many errors including grammar, spelling, content and the like of an original text on the premise of keeping the semantics of the original text.

Most of the existing text error correction methods are oriented to single language, and more than 100 languages of text error correction models need to be integrated in a multi-language scene, so that a universal method for multi-language text error correction is provided, and the problems of high cost and difficult maintenance for establishing a text error correction model for each language are solved.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a multilingual text error correction method based on a sequence labeling model, which overcomes the defects of the prior art and solves the problems of high cost and difficult maintenance of a text error correction model required to be established for each language.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a multilingual text error correction method based on a sequence labeling model comprises the following steps:

step S1: corpus data collection; selecting multiple language types from a Wikidata multi-language knowledge base, obtaining multi-language Wikipedia language material, cleaning hypertext markup language, extracting pure text, and segmenting and clauseing the language material;

step S2: generating data; taking multilingual Wikipedia corpus as a positive example, segmenting the sentences, randomly selecting 15% of words or characters, and deleting, inserting or replacing to generate a negative example;

step S3: labeling data; generating a text error detection label and a correction behavior label for each word or word of the negative example sentence according to the positive example sentence; classifying and labeling the text error detection label and the correction behavior label respectively;

step S4: training multilingual text corpora through a sequence labeling model, and outputting a generated text error detection label type with the maximum probability and a correction behavior label type;

step S5: after the correction behavior labels are predicted, corresponding increasing, deleting and modifying operations are carried out on each word or word according to the corresponding labels, and alternative items are generated.

Preferably, the generation of the negative example in the step S2 includes the following method:

step S21: randomly deleting a word or a word in the positive case;

step S22: randomly inserting a word or a character in the same language into the positive example;

step S23: randomly replacing a word or word in the same language in the positive example, wherein the editing distance between the replaced item and the replacing item is not more than 2.

Preferably, the classification of the text error detection tag in step S3 includes a CORRECT tag and an INCORRECT tag, where the CORRECT tag indicates that the word or word is CORRECT, and the INCORRECT tag indicates that the word or word is INCORRECT;

the classification of corrective action tags includes a KEEP tag, an INSERT tag, a DELETE tag, and a REPLACE tag; wherein the KEEP tag indicates that the word or word remains unchanged, the INSERT tag indicates that a word or word is inserted before the word or word, the DELETE tag indicates that the word or word is deleted, and the REPLACE tag indicates that the word or word is replaced.

Preferably, the step S4 specifically includes the following steps:

step S41: constructing a dictionary of multi-language words or characters, and segmenting the linguistic data by using a wordpiente algorithm;

step S42: embedding words or characters, and mapping each word or character into a 768-dimensional word vector; mapping the position of the word or character to a 768-dimensional position vector;

step S43: adding the 768-dimensional word vector and the position vector, inputting the sum to a multi-head attention layer, and outputting a 768-dimensional hidden vector;

step S44: respectively accessing 2 full-connection layers, wherein one full-connection layer is used for predicting error detection labels, inputting 768-dimensional hidden vectors and outputting 2-dimensional vectors, the other full-connection layer is used for predicting correction behavior labels, inputting 768-dimensional hidden vectors and outputting 4-dimensional vectors;

step S45: accessing a softmax mapping layer behind the full connection layer, carrying out normalization processing, and respectively calculating the probability of error detection labels and the probability of correction behavior labels;

step S46: according to the probability distribution of the two labels, each word or word outputs the two label types with the highest probability.

Preferably, the step S4 further includes:

step S47: designing a loss function:

the text error correction model comprises 2 sequence labeling tasks, so that the loss function of the model is divided into an error detection loss function and a corrective action loss function as follows:

loss＝loss_detection+loss_correction

M^d＝2

M^c＝4

wherein loss is the total loss function, loss_detectionLoss function for error detection_correctionFor correcting the behavior type loss function, N is the number of words or words of a single sentence, and is 60, M at maximum^dFor error detection of the number of labels, M^cIn order to correct the number of behavior tags,

for the weight of class j error detection tags,

in order to be the CORRECT tag weight,

for the INCORRECT label weight,

for the class j corrective action tag,

weights for KEEP, INSERT, REPLACE, and DELETE tags, respectively.

Preferably, in step S5, after predicting the corrective action tag, except that the KEEP-before-shape of the KEEP tag remains unchanged, the remaining words or words are correspondingly modified according to the corresponding tag, and an alternative is generated; the method specifically comprises the following steps:

(1) when the corrective action label of a certain word or word is predicted to be a DELETE label, directly deleting the word or word;

(2) when the correction behavior label of a certain word or word is predicted to be a REPLACE label, covering the word or word, and generating an alternative of the word or word by using a masking language model;

(3) when the corrective action tag of a word or word is predicted to be the INSERT tag, a masking marker is added in front of the word or word, and an alternative is generated by using a masking language model.

The invention provides a multilingual text error correction method based on a sequence labeling model. The method has the following beneficial effects: by the aid of the universal method for correcting the multi-language text, the problems that a text correction model needs to be established for each language, cost is high, and maintenance is difficult are solved.

Drawings

In order to more clearly illustrate the present invention or the prior art solutions, the drawings that are needed in the description of the prior art will be briefly described below.

FIG. 1 is a block flow diagram of the present invention;

FIG. 2 is a schematic diagram of positive and negative examples and label construction in the present invention;

FIG. 3 is a model diagram of text error correction sequence labeling according to the present invention;

FIG. 4 is an operation diagram of the present invention when the corrective action tag is predicted to REPLACE;

FIG. 5 is a diagram of the operation of the present invention when the corrective action tag predicts INSERT;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings.

As shown in fig. 1-5, a multi-language text error correction method based on a sequence annotation model includes the following steps:

step S1: corpus data collection;

selecting the language types which are ranked 100 above in a Wikidata multi-language knowledge base, downloading Wikipedia linguistic data of the language types, cleaning hypertext markup language, extracting pure text, and segmenting and dividing the linguistic data;

step S2: generating data;

taking multilingual Wikipedia corpus as a positive example, segmenting the sentences, randomly selecting 15% of words or characters, and deleting, inserting or replacing to generate a negative example;

the generation of negative examples includes the following ways:

(21): randomly deleting a word or a word in the positive case;

(22): randomly inserting a word or a character in the same language into the positive example;

(23): randomly replacing a word or a word in the same language in the positive example, wherein the editing distance between the replaced item and the replacing item is not more than 2; for example, English applets are replaced by applets, German Steigen is replaced by Steigern, and the generated negative examples are closer to the real error condition.

as shown in fig. 2, the classification of the text error detection tag includes a CORRECT tag and an INCORRECT tag; CORRECT tag indicates that the word or word is CORRECT, and INCORRECT tag indicates that the word or word is INCORRECT;

Step S4: training multilingual text corpora through a sequence labeling model, and outputting a generated text error detection label type with the maximum probability and a correction behavior label type; the method specifically comprises the following steps:

step S41: dividing words of multi-language text;

constructing a dictionary of multi-language words or characters, and segmenting the linguistic data by using a wordpiente algorithm;

step S42: a word embedding layer;

embedding words or characters, and mapping each word or character into a 768-dimensional word vector; mapping the positions (0-59) of the words or characters to a 768-dimensional position vector;

step S43: a multi-headed attention layer;

adding the 768-dimensional word vector and the position vector, inputting the sum into a multi-head attention layer (the number of heads is 12, the number of layers is 6), and outputting a 768-dimensional hidden vector;

step S44: a fully-connected layer;

respectively accessing 2 full-connection layers, wherein one full-connection layer is used for predicting error detection labels, inputting 768-dimensional hidden vectors and outputting 2-dimensional vectors, the other full-connection layer is used for predicting correction behavior labels, inputting 768-dimensional hidden vectors and outputting 4-dimensional vectors;

step S45: normalization processing;

accessing a softmax mapping layer behind the full connection layer, carrying out normalization processing, and respectively calculating the probability of error detection labels and the probability of correction behavior labels;

step S46: outputting the label;

according to the probability distribution of the two tags, the two tag types with the highest probability are output for each word or word, as shown in fig. 3.

Step S47: designing a loss function:

loss＝loss_detection+loss_correction

M^d＝2

M^c＝4

where loss is the total loss function，loss_detectionLoss function for error detection_correctionFor correcting the behavior type loss function, N is the number of words or words of a single sentence, and is 60, M at maximum^dFor error detection of the number of labels, M^cIn order to correct the number of behavior tags,

for the weight of class j error detection tags,

in order to be the CORRECT tag weight,

for the INCORRECT label weight,

for the class j corrective action tag,

weights for KEEP, INSERT, REPLACE, and DELETE tags, respectively.

When data is generated, 15% of words are processed, imbalance exists between the error detection labels and the correction action labels, most of the words or words are CORRECT or KEEP labels, and the parameter estimation of the model is seriously affected, so that the loss function weights of the INCORRECT labels, the INSERT labels, the REPLACE labels and the DELETE labels are respectively increased to 6 and 24, and the influence caused by imbalance of samples is reduced.

Wherein, the ratio of the training set to the test set in the training process of the sequence labeling model to the multilingual text corpus is 97: 3, the training set size is 97000000, the test set size is 3000000, the epoch is 10, the learning rate is 5e-5, and the evaluation indexes on the test set are as follows:

TABLE 1 evaluation index

	F1	Precision	Recall
				Error detection label	0.753	0.717	0.791
Corrective action tags	0.757	0.718	0.802

Step S5: after the correction behavior label is predicted, except that the KEEP label KEEPs the original shape unchanged, corresponding addition, deletion and modification operations are carried out on the residual words or the residual words according to the corresponding label, and alternative items are generated; the method specifically comprises the following steps:

(2) when the corrective action tag of a word or word is predicted to be a REPLACE tag, the word or word is masked, and an alternative of the word or word is generated by using a masking language model (as shown in FIG. 4);

(3) when the corrective action tag of a word or word is predicted to be the INSERT tag, a masking flag is added before the word or word, and an alternative is generated using a masking language model (as shown in FIG. 5).

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A multilingual text error correction method based on a sequence labeling model is characterized in that: the method comprises the following steps:

step S4: training a text error correction sequence labeling model; training multilingual text corpora through a sequence labeling model, and outputting a generated text error detection label type with the maximum probability and a correction behavior label type;

step S5: generating text error correction alternate items; after the correction behavior labels are predicted, corresponding increasing, deleting and modifying operations are carried out on each word or word according to the corresponding labels, and alternative items are generated.

2. The method of claim 1, wherein the method comprises: the generation of the negative example in the step S2 includes the following methods:

step S21: randomly deleting a word or a word in the positive case;

3. The method of claim 1, wherein the method comprises: the classification of the text error detection tag in step S3 includes a CORRECT tag and an INCORRECT tag, where the CORRECT tag indicates that a word or a word is CORRECT, and the INCORRECT tag indicates that a word or a word is INCORRECT;

4. The method according to claim 3, wherein the method comprises: the step S4 specifically includes the following steps:

5. The method of claim 4, wherein the method comprises: the step S4 further includes:

step S47: designing a loss function:

loss＝loss_detection+loss_correction

M^d＝2

M^c＝4

for the weight of class j error detection tags,

in order to be the CORRECT tag weight,

for the INCORRECT label weight,

for the class j corrective action tag,

weights for KEEP, INSERT, REPLACE, and DELETE tags, respectively.

6. The method of claim 5, wherein the method comprises: in step S5, after predicting the corrective action tag, except that the KEEP-before-shape of the KEEP tag remains unchanged, the remaining words or words are subjected to corresponding addition, deletion, and modification operations according to the corresponding tag, and generate alternative items; the method specifically comprises the following steps: