CN114372463A - Multi-language text error correction method based on sequence labeling model - Google Patents

Multi-language text error correction method based on sequence labeling model Download PDF

Info

Publication number
CN114372463A
CN114372463A CN202210023205.6A CN202210023205A CN114372463A CN 114372463 A CN114372463 A CN 114372463A CN 202210023205 A CN202210023205 A CN 202210023205A CN 114372463 A CN114372463 A CN 114372463A
Authority
CN
China
Prior art keywords
word
tag
label
correction
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210023205.6A
Other languages
Chinese (zh)
Inventor
李梅
潘丽婷
陈件
张井
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yizhe Information Technology Co ltd
Original Assignee
Shanghai Yizhe Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yizhe Information Technology Co ltd filed Critical Shanghai Yizhe Information Technology Co ltd
Priority to CN202210023205.6A priority Critical patent/CN114372463A/en
Publication of CN114372463A publication Critical patent/CN114372463A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

A multilingual text error correction method based on a sequence labeling model comprises the following steps: step S1: corpus data collection; step S2: generating data; taking multilingual Wikipedia corpus as a positive example, segmenting the sentences, randomly selecting 15% of words or characters, and deleting, inserting or replacing to generate a negative example; step S3: labeling data; generating a text error detection label and a correction behavior label; classifying and labeling the text error detection label and the correction behavior label respectively; step S4: training a text error correction sequence labeling model; step S5: and generating text correction alternatives. The invention overcomes the defects of the prior art and solves the problems of high cost and difficult maintenance of establishing a text error correction model for each language.

Description

Multi-language text error correction method based on sequence labeling model
Technical Field
The invention relates to the technical field of data processing, in particular to a multi-language text error correction method based on a sequence labeling model.
Background
With the increasing number of global non-native language learners, writers, and users, there is an increasing need for automated assessment of languages, such as text correction. The automatic text error correction needs to automatically correct many errors including grammar, spelling, content and the like of an original text on the premise of keeping the semantics of the original text.
Most of the existing text error correction methods are oriented to single language, and more than 100 languages of text error correction models need to be integrated in a multi-language scene, so that a universal method for multi-language text error correction is provided, and the problems of high cost and difficult maintenance for establishing a text error correction model for each language are solved.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a multilingual text error correction method based on a sequence labeling model, which overcomes the defects of the prior art and solves the problems of high cost and difficult maintenance of a text error correction model required to be established for each language.
In order to achieve the purpose, the invention is realized by the following technical scheme:
a multilingual text error correction method based on a sequence labeling model comprises the following steps:
step S1: corpus data collection; selecting multiple language types from a Wikidata multi-language knowledge base, obtaining multi-language Wikipedia language material, cleaning hypertext markup language, extracting pure text, and segmenting and clauseing the language material;
step S2: generating data; taking multilingual Wikipedia corpus as a positive example, segmenting the sentences, randomly selecting 15% of words or characters, and deleting, inserting or replacing to generate a negative example;
step S3: labeling data; generating a text error detection label and a correction behavior label for each word or word of the negative example sentence according to the positive example sentence; classifying and labeling the text error detection label and the correction behavior label respectively;
step S4: training multilingual text corpora through a sequence labeling model, and outputting a generated text error detection label type with the maximum probability and a correction behavior label type;
step S5: after the correction behavior labels are predicted, corresponding increasing, deleting and modifying operations are carried out on each word or word according to the corresponding labels, and alternative items are generated.
Preferably, the generation of the negative example in the step S2 includes the following method:
step S21: randomly deleting a word or a word in the positive case;
step S22: randomly inserting a word or a character in the same language into the positive example;
step S23: randomly replacing a word or word in the same language in the positive example, wherein the editing distance between the replaced item and the replacing item is not more than 2.
Preferably, the classification of the text error detection tag in step S3 includes a CORRECT tag and an INCORRECT tag, where the CORRECT tag indicates that the word or word is CORRECT, and the INCORRECT tag indicates that the word or word is INCORRECT;
the classification of corrective action tags includes a KEEP tag, an INSERT tag, a DELETE tag, and a REPLACE tag; wherein the KEEP tag indicates that the word or word remains unchanged, the INSERT tag indicates that a word or word is inserted before the word or word, the DELETE tag indicates that the word or word is deleted, and the REPLACE tag indicates that the word or word is replaced.
Preferably, the step S4 specifically includes the following steps:
step S41: constructing a dictionary of multi-language words or characters, and segmenting the linguistic data by using a wordpiente algorithm;
step S42: embedding words or characters, and mapping each word or character into a 768-dimensional word vector; mapping the position of the word or character to a 768-dimensional position vector;
step S43: adding the 768-dimensional word vector and the position vector, inputting the sum to a multi-head attention layer, and outputting a 768-dimensional hidden vector;
step S44: respectively accessing 2 full-connection layers, wherein one full-connection layer is used for predicting error detection labels, inputting 768-dimensional hidden vectors and outputting 2-dimensional vectors, the other full-connection layer is used for predicting correction behavior labels, inputting 768-dimensional hidden vectors and outputting 4-dimensional vectors;
step S45: accessing a softmax mapping layer behind the full connection layer, carrying out normalization processing, and respectively calculating the probability of error detection labels and the probability of correction behavior labels;
step S46: according to the probability distribution of the two labels, each word or word outputs the two label types with the highest probability.
Preferably, the step S4 further includes:
step S47: designing a loss function:
the text error correction model comprises 2 sequence labeling tasks, so that the loss function of the model is divided into an error detection loss function and a corrective action loss function as follows:
loss=lossdetection+losscorrection
Figure BDA0003463340720000031
Figure BDA0003463340720000032
Md=2
Mc=4
Figure BDA0003463340720000033
Figure BDA0003463340720000034
wherein loss is the total loss function, lossdetectionLoss function for error detectioncorrectionFor correcting the behavior type loss function, N is the number of words or words of a single sentence, and is 60, M at maximumdFor error detection of the number of labels, McIn order to correct the number of behavior tags,
Figure BDA0003463340720000035
for the weight of class j error detection tags,
Figure BDA0003463340720000041
in order to be the CORRECT tag weight,
Figure BDA0003463340720000042
for the INCORRECT label weight,
Figure BDA0003463340720000043
for the class j corrective action tag,
Figure BDA0003463340720000044
weights for KEEP, INSERT, REPLACE, and DELETE tags, respectively.
Preferably, in step S5, after predicting the corrective action tag, except that the KEEP-before-shape of the KEEP tag remains unchanged, the remaining words or words are correspondingly modified according to the corresponding tag, and an alternative is generated; the method specifically comprises the following steps:
(1) when the corrective action label of a certain word or word is predicted to be a DELETE label, directly deleting the word or word;
(2) when the correction behavior label of a certain word or word is predicted to be a REPLACE label, covering the word or word, and generating an alternative of the word or word by using a masking language model;
(3) when the corrective action tag of a word or word is predicted to be the INSERT tag, a masking marker is added in front of the word or word, and an alternative is generated by using a masking language model.
The invention provides a multilingual text error correction method based on a sequence labeling model. The method has the following beneficial effects: by the aid of the universal method for correcting the multi-language text, the problems that a text correction model needs to be established for each language, cost is high, and maintenance is difficult are solved.
Drawings
In order to more clearly illustrate the present invention or the prior art solutions, the drawings that are needed in the description of the prior art will be briefly described below.
FIG. 1 is a block flow diagram of the present invention;
FIG. 2 is a schematic diagram of positive and negative examples and label construction in the present invention;
FIG. 3 is a model diagram of text error correction sequence labeling according to the present invention;
FIG. 4 is an operation diagram of the present invention when the corrective action tag is predicted to REPLACE;
FIG. 5 is a diagram of the operation of the present invention when the corrective action tag predicts INSERT;
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings.
As shown in fig. 1-5, a multi-language text error correction method based on a sequence annotation model includes the following steps:
step S1: corpus data collection;
selecting the language types which are ranked 100 above in a Wikidata multi-language knowledge base, downloading Wikipedia linguistic data of the language types, cleaning hypertext markup language, extracting pure text, and segmenting and dividing the linguistic data;
step S2: generating data;
taking multilingual Wikipedia corpus as a positive example, segmenting the sentences, randomly selecting 15% of words or characters, and deleting, inserting or replacing to generate a negative example;
the generation of negative examples includes the following ways:
(21): randomly deleting a word or a word in the positive case;
(22): randomly inserting a word or a character in the same language into the positive example;
(23): randomly replacing a word or a word in the same language in the positive example, wherein the editing distance between the replaced item and the replacing item is not more than 2; for example, English applets are replaced by applets, German Steigen is replaced by Steigern, and the generated negative examples are closer to the real error condition.
Step S3: labeling data; generating a text error detection label and a correction behavior label for each word or word of the negative example sentence according to the positive example sentence; classifying and labeling the text error detection label and the correction behavior label respectively;
as shown in fig. 2, the classification of the text error detection tag includes a CORRECT tag and an INCORRECT tag; CORRECT tag indicates that the word or word is CORRECT, and INCORRECT tag indicates that the word or word is INCORRECT;
the classification of corrective action tags includes a KEEP tag, an INSERT tag, a DELETE tag, and a REPLACE tag; wherein the KEEP tag indicates that the word or word remains unchanged, the INSERT tag indicates that a word or word is inserted before the word or word, the DELETE tag indicates that the word or word is deleted, and the REPLACE tag indicates that the word or word is replaced.
Step S4: training multilingual text corpora through a sequence labeling model, and outputting a generated text error detection label type with the maximum probability and a correction behavior label type; the method specifically comprises the following steps:
step S41: dividing words of multi-language text;
constructing a dictionary of multi-language words or characters, and segmenting the linguistic data by using a wordpiente algorithm;
step S42: a word embedding layer;
embedding words or characters, and mapping each word or character into a 768-dimensional word vector; mapping the positions (0-59) of the words or characters to a 768-dimensional position vector;
step S43: a multi-headed attention layer;
adding the 768-dimensional word vector and the position vector, inputting the sum into a multi-head attention layer (the number of heads is 12, the number of layers is 6), and outputting a 768-dimensional hidden vector;
step S44: a fully-connected layer;
respectively accessing 2 full-connection layers, wherein one full-connection layer is used for predicting error detection labels, inputting 768-dimensional hidden vectors and outputting 2-dimensional vectors, the other full-connection layer is used for predicting correction behavior labels, inputting 768-dimensional hidden vectors and outputting 4-dimensional vectors;
step S45: normalization processing;
accessing a softmax mapping layer behind the full connection layer, carrying out normalization processing, and respectively calculating the probability of error detection labels and the probability of correction behavior labels;
step S46: outputting the label;
according to the probability distribution of the two tags, the two tag types with the highest probability are output for each word or word, as shown in fig. 3.
Step S47: designing a loss function:
the text error correction model comprises 2 sequence labeling tasks, so that the loss function of the model is divided into an error detection loss function and a corrective action loss function as follows:
loss=lossdetection+losscorrection
Figure BDA0003463340720000071
Figure BDA0003463340720000072
Md=2
Mc=4
Figure BDA0003463340720000073
Figure BDA0003463340720000074
where loss is the total loss function,lossdetectionLoss function for error detectioncorrectionFor correcting the behavior type loss function, N is the number of words or words of a single sentence, and is 60, M at maximumdFor error detection of the number of labels, McIn order to correct the number of behavior tags,
Figure BDA0003463340720000075
for the weight of class j error detection tags,
Figure BDA0003463340720000076
in order to be the CORRECT tag weight,
Figure BDA0003463340720000077
for the INCORRECT label weight,
Figure BDA0003463340720000078
for the class j corrective action tag,
Figure BDA0003463340720000079
weights for KEEP, INSERT, REPLACE, and DELETE tags, respectively.
When data is generated, 15% of words are processed, imbalance exists between the error detection labels and the correction action labels, most of the words or words are CORRECT or KEEP labels, and the parameter estimation of the model is seriously affected, so that the loss function weights of the INCORRECT labels, the INSERT labels, the REPLACE labels and the DELETE labels are respectively increased to 6 and 24, and the influence caused by imbalance of samples is reduced.
Wherein, the ratio of the training set to the test set in the training process of the sequence labeling model to the multilingual text corpus is 97: 3, the training set size is 97000000, the test set size is 3000000, the epoch is 10, the learning rate is 5e-5, and the evaluation indexes on the test set are as follows:
TABLE 1 evaluation index
F1 Precision Recall
Error detection label 0.753 0.717 0.791
Corrective action tags 0.757 0.718 0.802
Step S5: after the correction behavior label is predicted, except that the KEEP label KEEPs the original shape unchanged, corresponding addition, deletion and modification operations are carried out on the residual words or the residual words according to the corresponding label, and alternative items are generated; the method specifically comprises the following steps:
(1) when the corrective action label of a certain word or word is predicted to be a DELETE label, directly deleting the word or word;
(2) when the corrective action tag of a word or word is predicted to be a REPLACE tag, the word or word is masked, and an alternative of the word or word is generated by using a masking language model (as shown in FIG. 4);
(3) when the corrective action tag of a word or word is predicted to be the INSERT tag, a masking flag is added before the word or word, and an alternative is generated using a masking language model (as shown in FIG. 5).
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (6)

1. A multilingual text error correction method based on a sequence labeling model is characterized in that: the method comprises the following steps:
step S1: corpus data collection; selecting multiple language types from a Wikidata multi-language knowledge base, obtaining multi-language Wikipedia language material, cleaning hypertext markup language, extracting pure text, and segmenting and clauseing the language material;
step S2: generating data; taking multilingual Wikipedia corpus as a positive example, segmenting the sentences, randomly selecting 15% of words or characters, and deleting, inserting or replacing to generate a negative example;
step S3: labeling data; generating a text error detection label and a correction behavior label for each word or word of the negative example sentence according to the positive example sentence; classifying and labeling the text error detection label and the correction behavior label respectively;
step S4: training a text error correction sequence labeling model; training multilingual text corpora through a sequence labeling model, and outputting a generated text error detection label type with the maximum probability and a correction behavior label type;
step S5: generating text error correction alternate items; after the correction behavior labels are predicted, corresponding increasing, deleting and modifying operations are carried out on each word or word according to the corresponding labels, and alternative items are generated.
2. The method of claim 1, wherein the method comprises: the generation of the negative example in the step S2 includes the following methods:
step S21: randomly deleting a word or a word in the positive case;
step S22: randomly inserting a word or a character in the same language into the positive example;
step S23: randomly replacing a word or word in the same language in the positive example, wherein the editing distance between the replaced item and the replacing item is not more than 2.
3. The method of claim 1, wherein the method comprises: the classification of the text error detection tag in step S3 includes a CORRECT tag and an INCORRECT tag, where the CORRECT tag indicates that a word or a word is CORRECT, and the INCORRECT tag indicates that a word or a word is INCORRECT;
the classification of corrective action tags includes a KEEP tag, an INSERT tag, a DELETE tag, and a REPLACE tag; wherein the KEEP tag indicates that the word or word remains unchanged, the INSERT tag indicates that a word or word is inserted before the word or word, the DELETE tag indicates that the word or word is deleted, and the REPLACE tag indicates that the word or word is replaced.
4. The method according to claim 3, wherein the method comprises: the step S4 specifically includes the following steps:
step S41: constructing a dictionary of multi-language words or characters, and segmenting the linguistic data by using a wordpiente algorithm;
step S42: embedding words or characters, and mapping each word or character into a 768-dimensional word vector; mapping the position of the word or character to a 768-dimensional position vector;
step S43: adding the 768-dimensional word vector and the position vector, inputting the sum to a multi-head attention layer, and outputting a 768-dimensional hidden vector;
step S44: respectively accessing 2 full-connection layers, wherein one full-connection layer is used for predicting error detection labels, inputting 768-dimensional hidden vectors and outputting 2-dimensional vectors, the other full-connection layer is used for predicting correction behavior labels, inputting 768-dimensional hidden vectors and outputting 4-dimensional vectors;
step S45: accessing a softmax mapping layer behind the full connection layer, carrying out normalization processing, and respectively calculating the probability of error detection labels and the probability of correction behavior labels;
step S46: according to the probability distribution of the two labels, each word or word outputs the two label types with the highest probability.
5. The method of claim 4, wherein the method comprises: the step S4 further includes:
step S47: designing a loss function:
the text error correction model comprises 2 sequence labeling tasks, so that the loss function of the model is divided into an error detection loss function and a corrective action loss function as follows:
loss=lossdetection+losscorrection
Figure FDA0003463340710000031
Figure FDA0003463340710000032
Md=2
Mc=4
Figure FDA0003463340710000033
Figure FDA0003463340710000034
wherein loss is the total loss function, lossdetectionLoss function for error detectioncorrectionFor correcting the behavior type loss function, N is the number of words or words of a single sentence, and is 60, M at maximumdFor error detection of the number of labels, McIn order to correct the number of behavior tags,
Figure FDA0003463340710000035
for the weight of class j error detection tags,
Figure FDA0003463340710000036
in order to be the CORRECT tag weight,
Figure FDA0003463340710000037
for the INCORRECT label weight,
Figure FDA0003463340710000038
for the class j corrective action tag,
Figure FDA0003463340710000039
weights for KEEP, INSERT, REPLACE, and DELETE tags, respectively.
6. The method of claim 5, wherein the method comprises: in step S5, after predicting the corrective action tag, except that the KEEP-before-shape of the KEEP tag remains unchanged, the remaining words or words are subjected to corresponding addition, deletion, and modification operations according to the corresponding tag, and generate alternative items; the method specifically comprises the following steps:
(1) when the corrective action label of a certain word or word is predicted to be a DELETE label, directly deleting the word or word;
(2) when the correction behavior label of a certain word or word is predicted to be a REPLACE label, covering the word or word, and generating an alternative of the word or word by using a masking language model;
(3) when the corrective action tag of a word or word is predicted to be the INSERT tag, a masking marker is added in front of the word or word, and an alternative is generated by using a masking language model.
CN202210023205.6A 2022-01-10 2022-01-10 Multi-language text error correction method based on sequence labeling model Pending CN114372463A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210023205.6A CN114372463A (en) 2022-01-10 2022-01-10 Multi-language text error correction method based on sequence labeling model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210023205.6A CN114372463A (en) 2022-01-10 2022-01-10 Multi-language text error correction method based on sequence labeling model

Publications (1)

Publication Number Publication Date
CN114372463A true CN114372463A (en) 2022-04-19

Family

ID=81144094

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210023205.6A Pending CN114372463A (en) 2022-01-10 2022-01-10 Multi-language text error correction method based on sequence labeling model

Country Status (1)

Country Link
CN (1) CN114372463A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114925170A (en) * 2022-05-25 2022-08-19 人民网股份有限公司 Text proofreading model training method and device and computing equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114925170A (en) * 2022-05-25 2022-08-19 人民网股份有限公司 Text proofreading model training method and device and computing equipment
CN114925170B (en) * 2022-05-25 2023-04-07 人民网股份有限公司 Text proofreading model training method and device and computing equipment

Similar Documents

Publication Publication Date Title
CN110110327B (en) Text labeling method and equipment based on counterstudy
JP5356197B2 (en) Word semantic relation extraction device
CN108446271B (en) Text emotion analysis method of convolutional neural network based on Chinese character component characteristics
CN110134949B (en) Text labeling method and equipment based on teacher supervision
CN107729309A (en) A kind of method and device of the Chinese semantic analysis based on deep learning
KR101813683B1 (en) Method for automatic correction of errors in annotated corpus using kernel Ripple-Down Rules
CN108628828A (en) A kind of joint abstracting method of viewpoint and its holder based on from attention
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
US11232358B1 (en) Task specific processing of regulatory content
CN105988990A (en) Device and method for resolving zero anaphora in Chinese language, as well as training method
Singh et al. A decision tree based word sense disambiguation system in Manipuri language
CN110110334B (en) Remote consultation record text error correction method based on natural language processing
CN112364628B (en) New word recognition method and device, electronic equipment and storage medium
CN110147546B (en) Grammar correction method and device for spoken English
CN112287100A (en) Text recognition method, spelling error correction method and voice recognition method
CN109213998A (en) Chinese wrongly written character detection method and system
CN116306600B (en) MacBert-based Chinese text error correction method
CN113657098A (en) Text error correction method, device, equipment and storage medium
CN113448843A (en) Defect analysis-based image recognition software test data enhancement method and device
Boroş et al. A data-driven approach to verbal multiword expression detection. PARSEME Shared Task system description paper
CN113312918B (en) Word segmentation and capsule network law named entity identification method fusing radical vectors
CN114372463A (en) Multi-language text error correction method based on sequence labeling model
CN110674642A (en) Semantic relation extraction method for noisy sparse text
CN112613293A (en) Abstract generation method and device, electronic equipment and storage medium
CN115688703B (en) Text error correction method, storage medium and device in specific field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination