CN115204151A - Chinese text error correction method, system and readable storage medium - Google Patents

Chinese text error correction method, system and readable storage medium Download PDF

Info

Publication number
CN115204151A
CN115204151A CN202211118545.3A CN202211118545A CN115204151A CN 115204151 A CN115204151 A CN 115204151A CN 202211118545 A CN202211118545 A CN 202211118545A CN 115204151 A CN115204151 A CN 115204151A
Authority
CN
China
Prior art keywords
target
word
character
error correction
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211118545.3A
Other languages
Chinese (zh)
Inventor
王鹏鸣
郝书乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Jiaotong University
Original Assignee
East China Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Jiaotong University filed Critical East China Jiaotong University
Priority to CN202211118545.3A priority Critical patent/CN115204151A/en
Publication of CN115204151A publication Critical patent/CN115204151A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese text error correction method, a system and a readable storage medium, the method firstly carries out paragraph division on an original Chinese text to obtain a target paragraph, and forms a sentence vector, then the sentence vector is input into a Bert model, the sentence meaning can be fully understood by utilizing the characteristic of strong reading and understanding capability of the Bert model, so that the problems of multiple error types and long-distance dependence are solved, a preset number of predicted character codes which are ordered at the front are obtained, a predicted character set is generated, then a candidate character set is generated by combining a preset shape-similar dictionary, and finally a character is selected from an intersection of the predicted character set and the candidate character set to replace a target character.

Description

Chinese text error correction method, system and readable storage medium
Technical Field
The invention relates to the technical field of word processing, in particular to a Chinese text error correction method, a Chinese text error correction system and a readable storage medium.
Background
With the wide application of the pinyin input method and the voice recognition, more and more Chinese text errors occur due to phonetics, and simultaneously, a large number of similar characters occur due to the use of the handwriting input method. Correcting chinese text is a challenging issue.
The problems faced by the current Chinese text error correction are as follows: (1) the error types are many: under the current popular grammar error correction labeling system, grammar error types are divided into more than fifty types, if grammar errors can be even divided into more than hundred types according to the grammar teaching system, in the complicated error types, learning targets are difficult to unify, and the error correction accuracy is low. (2) Dependence on distance: taking the consistent subjects and predicates as an example, sometimes the subjects for judging the consistency of the subjects and predicates of the predicates are far away from the predicates, the consistent conditions of the subjects and predicates at a long distance are sparse in the training corpus, and the learning of long-distance information is just a difficulty of machine learning.
Disclosure of Invention
Therefore, the embodiment of the invention provides a Chinese text error correction method, a Chinese text error correction system and a readable storage medium, so as to improve the error correction accuracy and avoid the problem of long-distance dependence.
The embodiment of the invention provides a Chinese text error correction method, which comprises the following steps:
acquiring an input original Chinese text, performing paragraph division on the original Chinese text to obtain a target paragraph, and using an occlusion mark to occlude a target word in the target paragraph to obtain an occluded target paragraph;
coding the occlusion target paragraph to form a sentence vector;
inputting the sentence vector into a Bert model, predicting a plurality of predicted character codes at the position of an occlusion mark according to the context of the target paragraph, and calculating a score value of each predicted character code;
sorting the score values of the predicted character codes according to the order of the score values from high to low, and acquiring a preset number of predicted character codes which are sorted in the front;
decoding the predictive character codes with the preset number and the front sequence to obtain the corresponding predictive characters with the preset number and form a predictive character set;
judging whether the character of the target character exists in the predicted character set or not;
if the character of the target character does not exist in the predicted character set, generating a candidate character set by combining a preset shape-similar-sound dictionary according to the character of the target character;
generating an error correction set according to the predicted character set and the candidate character set, wherein the error correction set is an intersection of the predicted character set and the candidate character set;
if only one word exists in the error correction set, selecting the word only stored in the error correction set to replace the target word; if a plurality of characters exist in the error correction set, replacing the target characters respectively, calculating the confusion score of the sentence after each replacement, and taking the sentence with the highest confusion score as a final error correction result;
wherein, the calculation formula of the confusion score is expressed as:
Figure 78716DEST_PATH_IMAGE001
wherein,
Figure 58174DEST_PATH_IMAGE002
a score of the degree of confusion is expressed,
Figure 369069DEST_PATH_IMAGE003
which is representative of the current sentence,
Figure 59551DEST_PATH_IMAGE004
representing the total number of words in the sentence,
Figure 405082DEST_PATH_IMAGE005
the sequence number of the word in the sentence is represented,
Figure 758703DEST_PATH_IMAGE006
before showing
Figure 291316DEST_PATH_IMAGE004
The probability of a word or words,
Figure 21374DEST_PATH_IMAGE007
representation is based on front
Figure 221411DEST_PATH_IMAGE008
Word calculation to obtain the first
Figure 745934DEST_PATH_IMAGE005
The probability of a word.
According to the Chinese text error correction method provided by the embodiment of the invention, firstly, original Chinese text is subjected to paragraph division to obtain a target paragraph, a sentence vector is formed by the target paragraph, then the sentence vector is input into a Bert model, the sentence meaning can be fully understood by utilizing the characteristic of strong reading and understanding capability of the Bert model, so that the problems of multiple error types and long-distance dependence are solved, after a preset number of prediction character codes which are ranked in the front are obtained and a prediction character set is generated, a candidate character set is generated by combining a preset similarity dictionary, and finally, a character is selected from an intersection set which is taken by the prediction character set and the candidate character set to replace the target character, and most errors are errors caused by similarity and similarity. Therefore, the intersection of the predicted character set and the candidate character set is used as the basis of error correction, and the accuracy of error correction can be effectively improved.
In addition, the method for correcting the chinese text according to the above embodiment of the present invention may further have the following additional technical features:
further, in the second step, the step of encoding the occlusion target paragraph to form a sentence vector specifically includes:
coding the occlusion target paragraph by using a token function to form a sentence vector, and during coding, carrying out attention labeling on each token sequence;
sentence vector
Figure 31421DEST_PATH_IMAGE009
Is shown as
Figure 801057DEST_PATH_IMAGE010
Figure 590021DEST_PATH_IMAGE011
Wherein,
Figure 551024DEST_PATH_IMAGE012
which is representative of the operation of the encoder,
Figure 589387DEST_PATH_IMAGE013
is shown as
Figure 130090DEST_PATH_IMAGE014
The token value of a word is then set,
Figure 304719DEST_PATH_IMAGE015
is shown as
Figure 171044DEST_PATH_IMAGE016
The token value of a word is then set,
Figure 431124DEST_PATH_IMAGE017
is shown as
Figure 41097DEST_PATH_IMAGE005
The number of words is one of a plurality of words,
Figure 568768DEST_PATH_IMAGE018
is shown as
Figure 605994DEST_PATH_IMAGE004
And words.
Further, in the third step, the sentence vector is input into the Bert model, and in the step of predicting a plurality of predicted characters at the position of the occlusion marker according to the context of the target paragraph, the language features before and after are extracted by using a self-attention mechanism, and the corresponding formula is as follows:
Figure 87791DEST_PATH_IMAGE019
Figure 501455DEST_PATH_IMAGE020
Figure 385097DEST_PATH_IMAGE021
wherein,
Figure 858804DEST_PATH_IMAGE022
a first feature vector representing an Attention weight,
Figure 827897DEST_PATH_IMAGE023
a second feature vector representing the Attention weight,
Figure 779672DEST_PATH_IMAGE024
a vector representing the features of the input is,
Figure 284865DEST_PATH_IMAGE025
a weight representing the first feature vector is calculated,
Figure 929473DEST_PATH_IMAGE026
a weight representing the second feature vector,
Figure 120283DEST_PATH_IMAGE027
the weight of the vector representing the input feature,
Figure 875750DEST_PATH_IMAGE009
representing a sentence vector.
Further, in the third step, for the above
Figure 468405DEST_PATH_IMAGE028
Performing a linear transformation using the plurality of sets of parameter tables to obtain a result from the attention mechanism,the corresponding formula is:
Figure 283914DEST_PATH_IMAGE029
Figure 227599DEST_PATH_IMAGE030
wherein,
Figure 521178DEST_PATH_IMAGE031
presentation pair
Figure 233919DEST_PATH_IMAGE028
A multi-head self-attention operation is performed,
Figure 453285DEST_PATH_IMAGE032
showing the combined operation of multiple heads of attention,
Figure 884266DEST_PATH_IMAGE033
the basis weight is represented by a weight of the basis,
Figure 981535DEST_PATH_IMAGE034
denotes the first
Figure 814362DEST_PATH_IMAGE014
The attention head to which the word corresponds,
Figure 440516DEST_PATH_IMAGE035
denotes the first
Figure 358793DEST_PATH_IMAGE014
The weight of the first feature vector to which the word corresponds,
Figure 259753DEST_PATH_IMAGE036
is shown as
Figure 681507DEST_PATH_IMAGE014
The weight of the second feature vector to which the word corresponds,
Figure 9720DEST_PATH_IMAGE037
is shown as
Figure 651179DEST_PATH_IMAGE014
The weights of the vectors of input features to which the words correspond,
Figure 90251DEST_PATH_IMAGE038
indicating that the corresponding Attention function operates.
Further, in the third step, in the step of calculating the score value of each predicted character code, the score value of the predicted character code is calculated by using the following formula:
Figure 632091DEST_PATH_IMAGE039
wherein,
Figure 396784DEST_PATH_IMAGE040
a score value representing the encoding of the predicted character,
Figure 24075DEST_PATH_IMAGE041
representing a transpose operation.
Further, after the step of determining whether the character of the target word exists in the predicted character set, the method further includes:
if the character of the target character exists in the predicted character set, judging that the target character does not need error correction;
after the step of generating an error correction set according to the predicted character set and the candidate character set, the method further comprises:
determining whether the error correction set is empty;
if the error correction set is empty, judging that the target word does not need error correction;
and if the error correction set is not empty, selecting one word in the error correction set to replace the target word.
Further, the step of generating a candidate character set in combination with a preset morphological and phonetic similarity dictionary according to the characters of the target word specifically includes:
and searching a plurality of candidate characters from a preset shape-pronunciation-like dictionary by using an edit distance algorithm according to the characters of the target words, and generating a candidate character set.
Further, for the Bert model, the loss function of the first part and the loss function of the second part are included, and the loss function of the first part and the loss function of the second part form a total loss function;
the loss function of the first part is expressed as:
Figure 1258DEST_PATH_IMAGE042
wherein,
Figure 397604DEST_PATH_IMAGE043
representing the penalty function of the first part of the Bert model,
Figure 67620DEST_PATH_IMAGE044
representing the maximum number of words in the set formed by the words of the mask,
Figure 680742DEST_PATH_IMAGE045
parameters representing the Encoder part of the Bert model,
Figure 461616DEST_PATH_IMAGE046
represents parameters in the output layer connected to the Encoder in the Mask-LM task,
Figure 978048DEST_PATH_IMAGE047
the size of the dictionary is expressed in terms of,
Figure 553386DEST_PATH_IMAGE048
representing the first in a word set
Figure 155268DEST_PATH_IMAGE014
The number of words is one of a plurality of words,
Figure 739834DEST_PATH_IMAGE049
representing the first in the word set
Figure 110772DEST_PATH_IMAGE014
The score probability corresponding to the word;
the second part of the loss function, expressed correspondingly as:
Figure 122590DEST_PATH_IMAGE050
wherein,
Figure 946190DEST_PATH_IMAGE051
a penalty function representing the second part of the Bert model,
Figure 570332DEST_PATH_IMAGE052
representing the classifier parameters followed by the Encoder in the sentence prediction task,
Figure 795777DEST_PATH_IMAGE053
representing the first in a word set
Figure 978496DEST_PATH_IMAGE014
The relationship of the correspondence of the words,
Figure 554971DEST_PATH_IMAGE054
the kind of the relation is represented by,
Figure 481339DEST_PATH_IMAGE055
indicating that there is a relationship between two sentences adjacent in front and back,
Figure 561290DEST_PATH_IMAGE056
indicating that there is no relationship between two sentences adjacent to each other,
Figure 180490DEST_PATH_IMAGE057
representing the first in the word set
Figure 978682DEST_PATH_IMAGE014
The score probability of the word correspondence;
the total loss function is expressed as:
Figure 443161DEST_PATH_IMAGE058
wherein,
Figure 876155DEST_PATH_IMAGE059
the total loss function of the Bert model is represented.
The invention also provides a Chinese text error correction system, wherein the system comprises:
the target shielding module is used for acquiring an input original Chinese text, performing paragraph division on the original Chinese text to obtain a target paragraph, and shielding a target word in the target paragraph by using a shielding mark to obtain a shielded target paragraph;
the target coding module is used for coding the shielded target paragraphs to form sentence vectors;
the coding prediction module is used for inputting the sentence vector into a Bert model, predicting a plurality of predicted character codes at the position of an occlusion mark according to the context of the target paragraph, and calculating the score of each predicted character code;
the score sorting module is used for sorting the score values of all the predictive character codes according to the order of the score values from high to low and acquiring the predictive character codes with the preset number which are sorted in the front;
the encoding and decoding module is used for decoding the predicted character codes with the preset number and the front order to obtain the corresponding predicted characters with the preset number and form a predicted character set;
the character judgment module is used for judging whether the character of the target character exists in the predicted character set or not;
the first generation module is used for generating a candidate character set in combination with a preset shape-similarity dictionary according to the characters of the target characters if the characters of the target characters do not exist in the predicted character set;
the second generation module is used for generating an error correction set according to the predicted character set and the candidate character set, wherein the error correction set is an intersection of the predicted character set and the candidate character set;
the target replacement module is used for selecting only one word in the error correction set to replace the target word if only one word exists in the error correction set; if a plurality of characters exist in the error correction set, replacing the target characters respectively, calculating the confusion score of the sentence after each replacement, and taking the sentence with the highest confusion score as a final error correction result;
wherein, the calculation formula of the confusion score is expressed as:
Figure 666256DEST_PATH_IMAGE001
wherein,
Figure 951744DEST_PATH_IMAGE002
a score indicative of the degree of confusion is provided,
Figure 219914DEST_PATH_IMAGE003
which represents the current sentence in question,
Figure 274458DEST_PATH_IMAGE004
representing the total number of words in the sentence,
Figure 969881DEST_PATH_IMAGE005
representing the sequence number of the words in the sentence,
Figure 477086DEST_PATH_IMAGE006
before showing
Figure 548947DEST_PATH_IMAGE004
The probability of a word or words,
Figure 225042DEST_PATH_IMAGE007
representation is based on front
Figure 91366DEST_PATH_IMAGE008
Word calculation yields
Figure 351446DEST_PATH_IMAGE005
The probability of a word.
The invention also proposes a readable storage medium on which a computer program is stored, wherein said program, when executed by a processor, implements the chinese text correction method as described above.
Drawings
The above and/or additional aspects and advantages of embodiments of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a method for correcting errors in Chinese text according to an embodiment of the present invention;
fig. 2 is a block diagram of a chinese text error correction system according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Referring to fig. 1, a method for correcting a chinese text according to an embodiment of the present invention includes the following steps S1 to S9:
s1, acquiring an input original Chinese text, carrying out paragraph division on the original Chinese text to obtain a target paragraph, and using an occlusion mark to occlude a target word in the target paragraph to obtain an occluded target paragraph.
Because the Bert model has good performance on extracting feature words of long texts and understanding sentence meanings, the original Chinese text is segmented according to paragraphs and then transmitted into the Bert model, so that the Bert model can fully understand the sentence meanings.
In this embodiment, the occlusion mark is represented by [ MASK ], for example, "he is a monarch", and the occlusion of the target word "monarch" therein results in "he is a [ MASK ] king", and then the word at the [ MASK ] position is predicted.
In this step, the target paragraph after paragraph division is expressed in the form of sentence and phrase list:
Figure 961419DEST_PATH_IMAGE060
. Wherein,
Figure 990555DEST_PATH_IMAGE061
represents the second in the target paragraph
Figure 27781DEST_PATH_IMAGE014
The number of words is one of a plurality of words,
Figure 775158DEST_PATH_IMAGE062
represent the first in the object paragraph
Figure 188821DEST_PATH_IMAGE063
The number of words is such that,
Figure 570999DEST_PATH_IMAGE064
Figure 44706DEST_PATH_IMAGE063
representing the maximum number of words in the target paragraph.
And S2, coding the occlusion target paragraph to form a sentence vector.
The characters are directly input into the model for calculation, the calculation intensity is overlarge, each character is digitally encoded according to the dictionary, and the calculation amount can be greatly reduced by inputting the characters into the model for calculation.
Specifically, the occlusion target paragraphs are encoded by using a token function to form a sentence vector, and during encoding, attention labeling is performed on each token sequence.
Since the Bert model can only process input sequences with a length of 512 tokens, if the input length is smaller than 512 token sequences, 0 sequences can be complemented later. In this embodiment, during encoding, attention labeling is performed on each token sequence, that is, the Bert model has an entry _ mask sequence in addition to the token sequence, and if the token is a token of a character in a text, the entry _ mask is labeled as 1, which indicates that the token is to be focused during model calculation; for the complementary 0 sequence, its attribute _ mask is labeled as 0, and this meaningless token is not considered in the model computation.
In this step, the sentence vector
Figure 748219DEST_PATH_IMAGE009
Is shown as
Figure 699995DEST_PATH_IMAGE010
Figure 703723DEST_PATH_IMAGE011
Wherein,
Figure 82752DEST_PATH_IMAGE012
which is representative of the operation of the encoder,
Figure 539141DEST_PATH_IMAGE013
is shown as
Figure 294607DEST_PATH_IMAGE014
The token value of a word is set to,
Figure 152842DEST_PATH_IMAGE015
is shown as
Figure 469816DEST_PATH_IMAGE016
The token value of a word is then set,
Figure 413501DEST_PATH_IMAGE017
is shown as
Figure 441500DEST_PATH_IMAGE005
The number of words is one of a plurality of words,
Figure 419820DEST_PATH_IMAGE018
denotes the first
Figure 406231DEST_PATH_IMAGE004
And words.
And S3, inputting the sentence vector into a Bert model, predicting a plurality of predicted character codes at the position of the shielding mark according to the context of the target paragraph, and calculating the score of each predicted character code.
In this step, the language features before and after the extraction are extracted using a self-attention mechanism. Specifically, there is the following calculation formula:
Figure 837212DEST_PATH_IMAGE019
Figure 668902DEST_PATH_IMAGE020
Figure 236149DEST_PATH_IMAGE021
wherein,
Figure 915434DEST_PATH_IMAGE022
a first feature vector representing an Attention weight,
Figure 833711DEST_PATH_IMAGE023
a second feature vector representing the Attention weight,
Figure 469092DEST_PATH_IMAGE024
a vector representing the features of the input is,
Figure 156425DEST_PATH_IMAGE025
a weight representing the first feature vector is calculated,
Figure 484639DEST_PATH_IMAGE026
a weight representing the second feature vector,
Figure 359054DEST_PATH_IMAGE027
the weight of the vector representing the input feature,
Figure 63704DEST_PATH_IMAGE009
representing a sentence vector.
For the above
Figure 605544DEST_PATH_IMAGE028
Linear transformations are performed using sets of parameter tables. Finally, the result of the self-attention mechanism is the integration of the results of all self-attention mechanisms.
Figure 104659DEST_PATH_IMAGE029
Figure 233414DEST_PATH_IMAGE030
Wherein,
Figure 210597DEST_PATH_IMAGE031
pair of representations
Figure 872523DEST_PATH_IMAGE028
A multi-head self-attention operation is performed,
Figure 542539DEST_PATH_IMAGE032
showing the combined operation of multiple heads of attention,
Figure 657125DEST_PATH_IMAGE033
the basis weight is represented by a weight of the basis,
Figure 437999DEST_PATH_IMAGE034
is shown as
Figure 954431DEST_PATH_IMAGE014
The attention head to which the word corresponds,
Figure 795348DEST_PATH_IMAGE035
is shown as
Figure 895766DEST_PATH_IMAGE014
The weight of the first feature vector to which the word corresponds,
Figure 214752DEST_PATH_IMAGE036
is shown as
Figure 585690DEST_PATH_IMAGE014
The weight of the second feature vector to which the word corresponds,
Figure 597509DEST_PATH_IMAGE037
is shown as
Figure 421108DEST_PATH_IMAGE014
The weight of the vector of input features to which the word corresponds,
Figure 809364DEST_PATH_IMAGE038
representing the corresponding Attention function operation.
Additionally, in the present embodiment, the scale factor is used
Figure 34809DEST_PATH_IMAGE065
Added to the self-attention mechanism, the calculation formula is as follows:
Figure 217529DEST_PATH_IMAGE066
wherein,
Figure 295469DEST_PATH_IMAGE067
which represents the operation of the normalization function, is,
Figure 956257DEST_PATH_IMAGE068
representing a transpose operation.
Further, in this step, the score value of the predicted character encoding is calculated using the following formula:
Figure 36209DEST_PATH_IMAGE039
wherein,
Figure 655409DEST_PATH_IMAGE040
a score value representing the predicted character encoding.
Still taking "he is a [ MASK ] king" as an example, if we want to predict the word of this [ MASK ] position, the predicted word may not be so accurate, e.g., we may predict the words of "di, medi, \ 8230;" etc., at which time although "di" has been predicted, it is possible for the Bert model that "di" is as much as the words of "di", "day", medi ", etc., but" jie khaki is very brave "if" he was the last sentence of a [ MASK ] king ". The score value of the ' di ' word calculated by the formula is far higher than the score values of the ' di ', ' tian ', meditation ' and other words.
And S4, sorting the score values of the predictive character codes according to the order of the score values from high to low, and acquiring the predictive character codes with the preset number, which are sorted in the front, of the predictive character codes.
Preferably, the predetermined number is at least 5.
And S5, decoding the predictive character codes with the preset number, which are ranked at the top, to obtain the corresponding predictive characters with the preset number, and forming a predictive character set.
In the present invention, for the above-mentioned Bert model, the penalty function of the Bert model consists of two parts. The first part is from the word-level classification task of Mask-LM, the other part is the classification task of sentence level. Through the joint learning of the two tasks, the representation learned by the Bert has token-level information and also contains sentence-level semantic information.
Specifically, in the loss function of the first part, if the words to be mask form a set,since it is a dictionary size
Figure 453600DEST_PATH_IMAGE047
The first part of the loss function is expressed as:
Figure 918080DEST_PATH_IMAGE042
wherein,
Figure 118117DEST_PATH_IMAGE043
representing the penalty function of the first part of the Bert model,
Figure 908219DEST_PATH_IMAGE044
representing the maximum number of words in the set formed by the words of the mask,
Figure 928127DEST_PATH_IMAGE045
parameters representing the Encoder part of the Bert model,
Figure 960412DEST_PATH_IMAGE046
represents the parameters in the output layer connected to the Encoder in the Mask-LM task,
Figure 14955DEST_PATH_IMAGE047
the size of the dictionary is indicated and,
Figure 710379DEST_PATH_IMAGE048
representing the first in the word set
Figure 483163DEST_PATH_IMAGE014
The number of words is one of a plurality of words,
Figure 289445DEST_PATH_IMAGE049
representing the first in the word set
Figure 198495DEST_PATH_IMAGE014
The score probability that a word corresponds to.
Further, in the sentence prediction task, there is a loss function of the second part of the classification problem, which is expressed as:
Figure 595978DEST_PATH_IMAGE050
wherein,
Figure 590479DEST_PATH_IMAGE051
representing the penalty function of the second part of the Bert model,
Figure 701917DEST_PATH_IMAGE052
representing the classifier parameters followed by the Encoder in the sentence prediction task,
Figure 731053DEST_PATH_IMAGE053
representing the first in a word set
Figure 152633DEST_PATH_IMAGE014
The relationship of the correspondence of the words,
Figure 900009DEST_PATH_IMAGE054
the kind of the relation is represented by,
Figure 812208DEST_PATH_IMAGE055
indicating that there is a relationship between two sentences that are adjacent to each other,
Figure 716358DEST_PATH_IMAGE056
indicating that there is no relationship between two sentences adjacent to each other,
Figure 924485DEST_PATH_IMAGE057
to represent the first in a set of words
Figure 952966DEST_PATH_IMAGE014
The score probability of the word correspondence.
Thus, for the Bert model, the total loss function of the two task joint learning is represented as:
Figure 170320DEST_PATH_IMAGE069
wherein,
Figure 534568DEST_PATH_IMAGE059
the total loss function of the Bert model is represented.
And S6, judging whether the character of the target character exists in the predicted character set.
It can be understood that if the character of the target word exists in the predicted character set, it is determined that the target word does not need error correction, and then the other words except the target word are shielded and the flow of steps S1 to S6 is repeated.
And S7, if the character of the target character does not exist in the predicted character set, generating a candidate character set by combining a preset shape-similar-sound dictionary according to the character of the target character.
Specifically, according to the characters of the target words, a plurality of candidate characters are searched from a preset shape-likeness dictionary by using an edit distance algorithm, and a candidate character set is generated.
And S8, generating an error correction set according to the predicted character set and the candidate character set, wherein the error correction set is an intersection of the predicted character set and the candidate character set.
S9, if only one word exists in the error correction set, selecting the only one word in the error correction set to replace the target word; and if a plurality of words exist in the error correction set, replacing the target words respectively, calculating the confusion score of the sentence after each replacement, and taking the sentence with the highest confusion score as a final error correction result.
Wherein, the calculation formula of the confusion score is expressed as:
Figure 179176DEST_PATH_IMAGE001
wherein,
Figure 369986DEST_PATH_IMAGE002
a score indicative of the degree of confusion is provided,
Figure 125452DEST_PATH_IMAGE003
which is representative of the current sentence,
Figure 983687DEST_PATH_IMAGE004
representing the total number of words in the sentence,
Figure 297731DEST_PATH_IMAGE005
the sequence number of the word in the sentence is represented,
Figure 975837DEST_PATH_IMAGE006
before showing
Figure 534994DEST_PATH_IMAGE004
The probability of a word or words,
Figure 247735DEST_PATH_IMAGE007
representation based on front
Figure 968567DEST_PATH_IMAGE008
Word calculation to obtain the first
Figure 399548DEST_PATH_IMAGE005
The probability of a word.
As a specific example, after the step of generating an error correction set according to the predicted character set and the candidate character set, the method further includes:
determining whether the error correction set is empty;
if the error correction set is empty, judging that the target word does not need error correction;
and if the error correction set is not empty, selecting one word in the error correction set to replace the target word.
In summary, according to the chinese text error correction method provided in this embodiment, first, a paragraph is divided from an original chinese text to obtain a target paragraph, and a sentence vector is formed by the target paragraph, and then the sentence vector is input into a Bert model, and the meaning of the sentence can be fully understood by using the characteristic of the strong reading and understanding capability of the Bert model, so as to solve the problems of multiple error types and long-distance dependence.
Referring to fig. 2, the present invention further provides a chinese text error correction system, wherein the system includes:
the target shielding module is used for acquiring an input original Chinese text, performing paragraph division on the original Chinese text to obtain a target paragraph, and shielding target words in the target paragraph by using shielding marks to obtain a shielded target paragraph;
the target coding module is used for coding the shielding target paragraph to form a sentence vector;
the coding prediction module is used for inputting the sentence vector into a Bert model, predicting a plurality of predicted character codes at the position of an occlusion mark according to the context of the target paragraph, and calculating the score of each predicted character code;
the score sorting module is used for sorting the score values of all the predicted character codes according to the order of the score values from high to low and acquiring the preset number of the predicted character codes which are sorted in the front;
the encoding and decoding module is used for decoding the predicted character codes with the preset number which are ranked in the front to obtain the corresponding predicted characters with the preset number and form a predicted character set;
the character judgment module is used for judging whether the character of the target character exists in the predicted character set or not;
the first generation module is used for generating a candidate character set in combination with a preset shape-similarity dictionary according to the characters of the target characters if the characters of the target characters do not exist in the predicted character set;
the second generation module is used for generating an error correction set according to the predicted character set and the candidate character set, wherein the error correction set is an intersection of the predicted character set and the candidate character set;
the target replacement module is used for selecting only one word stored in the error correction set to replace the target word if only one word exists in the error correction set; if a plurality of characters exist in the error correction set, replacing the target characters respectively, calculating the confusion score of the sentence after each replacement, and taking the sentence with the highest confusion score as a final error correction result;
wherein, the calculation formula of the confusion score is expressed as:
Figure 762396DEST_PATH_IMAGE001
wherein,
Figure 595223DEST_PATH_IMAGE002
a score of the degree of confusion is expressed,
Figure 254000DEST_PATH_IMAGE003
which is representative of the current sentence,
Figure 906698DEST_PATH_IMAGE004
representing the total number of words in the sentence,
Figure 807658DEST_PATH_IMAGE005
the sequence number of the word in the sentence is represented,
Figure 494991DEST_PATH_IMAGE006
before showing
Figure 823204DEST_PATH_IMAGE004
The probability of a word or words,
Figure 963199DEST_PATH_IMAGE007
representation based on front
Figure 136691DEST_PATH_IMAGE008
Word calculationFirst, the
Figure 678531DEST_PATH_IMAGE005
The probability of a word.
The invention also proposes a readable storage medium on which a computer program is stored, wherein said program, when executed by a processor, implements the chinese text correction method as described above.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following technologies, which are well known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims (10)

1. A Chinese text error correction method is characterized by comprising the following steps:
step one, acquiring an input original Chinese text, performing paragraph division on the original Chinese text to obtain a target paragraph, and shielding target words in the target paragraph by using a shielding mark to obtain a shielded target paragraph;
step two, encoding the occlusion target paragraph to form a sentence vector;
inputting the sentence vector into a Bert model, predicting a plurality of predicted character codes at the position of an occlusion mark according to the context of the target paragraph, and calculating a score of each predicted character code;
sorting score values of all the predictive character codes according to the order of the score values from high to low, and acquiring a preset number of predictive character codes which are sorted in the front;
decoding the predictive character codes with the preset number and the front sequence to obtain the corresponding predictive characters with the preset number and form a predictive character set;
step six, judging whether the character of the target character exists in the predicted character set or not;
step seven, if the character of the target character does not exist in the predicted character set, generating a candidate character set by combining a preset shape-similar-sound dictionary according to the character of the target character;
step eight, generating an error correction set according to the predicted character set and the candidate character set, wherein the error correction set is an intersection of the predicted character set and the candidate character set;
step nine, if only one word exists in the error correction set, selecting the only one word in the error correction set to replace the target word; if a plurality of words exist in the error correction set, replacing the target words respectively, calculating the confusion score of the sentence after each replacement, and taking the sentence with the highest confusion score as a final error correction result;
wherein, the calculation formula of the confusion score is expressed as:
Figure 743252DEST_PATH_IMAGE001
wherein,
Figure 191551DEST_PATH_IMAGE002
a score indicative of the degree of confusion is provided,
Figure 502446DEST_PATH_IMAGE003
which represents the current sentence in question,
Figure 179547DEST_PATH_IMAGE004
representing the total number of words in the sentence,
Figure 728340DEST_PATH_IMAGE005
the sequence number of the word in the sentence is represented,
Figure 81961DEST_PATH_IMAGE006
before showing
Figure 880152DEST_PATH_IMAGE004
The probability of a word or words,
Figure 610211DEST_PATH_IMAGE007
representation is based on front
Figure 560980DEST_PATH_IMAGE008
Word calculation yields
Figure 85503DEST_PATH_IMAGE005
The probability of a word.
2. The method for correcting errors in chinese text according to claim 1, wherein in the second step, the step of encoding the occlusion target paragraphs to form sentence vectors specifically comprises:
coding the occlusion target paragraph by using a token function to form a sentence vector, and during coding, carrying out attention labeling on each token sequence;
sentence vector
Figure 370991DEST_PATH_IMAGE009
Is shown as
Figure 639161DEST_PATH_IMAGE010
Figure 441507DEST_PATH_IMAGE011
Wherein,
Figure 402510DEST_PATH_IMAGE012
which represents the operation of the encoder, is,
Figure 440873DEST_PATH_IMAGE013
is shown as
Figure 247155DEST_PATH_IMAGE014
The token value of a word is set to,
Figure 172517DEST_PATH_IMAGE015
is shown as
Figure 38842DEST_PATH_IMAGE016
The token value of a word is then set,
Figure 298922DEST_PATH_IMAGE017
is shown as
Figure 908895DEST_PATH_IMAGE005
The number of words is one of a plurality of words,
Figure 688763DEST_PATH_IMAGE018
is shown as
Figure 194831DEST_PATH_IMAGE004
And words.
3. The method for correcting chinese text according to claim 2, wherein in the third step, the sentence vector is input into a Bert model, and in the step of predicting the plurality of predicted characters at the positions of the occlusion markers according to the context of the target paragraph, the language features before and after being extracted by using a self-attention mechanism are represented by the following formula:
Figure 676628DEST_PATH_IMAGE019
Figure 887029DEST_PATH_IMAGE020
Figure 973934DEST_PATH_IMAGE021
wherein,
Figure 198373DEST_PATH_IMAGE022
a first feature vector representing an Attention weight,
Figure 167466DEST_PATH_IMAGE023
a second feature vector representing the Attention weight,
Figure 119241DEST_PATH_IMAGE024
a vector representing the features of the input is,
Figure 122970DEST_PATH_IMAGE025
a weight representing the first feature vector is calculated,
Figure 515380DEST_PATH_IMAGE026
a weight representing the second feature vector is calculated,
Figure 971769DEST_PATH_IMAGE027
the weight of the vector representing the input feature,
Figure 992815DEST_PATH_IMAGE009
representing a sentence vector.
4. The method of claim 3, wherein the step three is a step of correcting the Chinese textFor the above
Figure 585470DEST_PATH_IMAGE028
Linear transformation is performed using multiple sets of parameter tables to obtain the result of the self-attention mechanism, and the corresponding formula is:
Figure 151712DEST_PATH_IMAGE029
Figure 95397DEST_PATH_IMAGE030
wherein,
Figure 654555DEST_PATH_IMAGE031
pair of representations
Figure 570558DEST_PATH_IMAGE028
A multi-head self-attention operation is performed,
Figure 291389DEST_PATH_IMAGE032
showing the combined operation of multiple heads of attention,
Figure 473103DEST_PATH_IMAGE033
the basis weight is represented by a weight of the basis,
Figure 570372DEST_PATH_IMAGE034
denotes the first
Figure 668778DEST_PATH_IMAGE014
The attention head to which the word corresponds,
Figure 294932DEST_PATH_IMAGE035
denotes the first
Figure 229521DEST_PATH_IMAGE014
The weight of the first feature vector to which the word corresponds,
Figure 130481DEST_PATH_IMAGE036
is shown as
Figure 552235DEST_PATH_IMAGE014
The weight of the second feature vector to which the word corresponds,
Figure 880448DEST_PATH_IMAGE037
denotes the first
Figure 768245DEST_PATH_IMAGE014
The weights of the vectors of input features to which the words correspond,
Figure 472896DEST_PATH_IMAGE038
indicating that the corresponding Attention function operates.
5. The method of correcting chinese text according to claim 4, wherein in the step three, in the step of calculating the score value of each predictive character encoding, the score value of the predictive character encoding is calculated using the following equation:
Figure 217998DEST_PATH_IMAGE039
wherein,
Figure 717112DEST_PATH_IMAGE040
a score value representing the encoding of the predictive character,
Figure 344403DEST_PATH_IMAGE041
representing a transpose operation.
6. The chinese text error correction method of claim 1, wherein after the step of determining whether the character of the target word is present in the predicted character set, the method further comprises:
if the character of the target character exists in the predicted character set, judging that the target character does not need error correction;
after the step of generating an error correction set according to the predicted character set and the candidate character set, the method further comprises:
determining whether the error correction set is empty;
if the error correction set is empty, judging that the target word does not need error correction;
and if the error correction set is not empty, selecting one word in the error correction set to replace the target word.
7. The method for correcting errors in chinese text according to claim 1, wherein the step of generating a candidate character set in combination with a preset voice-like dictionary based on the characters of the target word specifically comprises:
and searching a plurality of candidate characters from a preset shape-pronunciation-like dictionary by using an edit distance algorithm according to the characters of the target words, and generating a candidate character set.
8. The method of claim 1, wherein the Bert model includes a first part loss function and a second part loss function, and the first part loss function and the second part loss function form a total loss function;
the loss function of the first part is expressed as:
Figure 337898DEST_PATH_IMAGE042
wherein,
Figure 734244DEST_PATH_IMAGE043
representing the penalty function of the first part of the Bert model,
Figure 669839DEST_PATH_IMAGE044
representing the maximum number of words in the set formed by the words of the mask,
Figure 784425DEST_PATH_IMAGE045
parameters representing the Encoder part of the Bert model,
Figure 316032DEST_PATH_IMAGE046
represents the parameters in the output layer connected to the Encoder in the Mask-LM task,
Figure 832464DEST_PATH_IMAGE047
the size of the dictionary is expressed in terms of,
Figure 673381DEST_PATH_IMAGE048
representing the first in the word set
Figure 275264DEST_PATH_IMAGE014
The number of words is one of a plurality of words,
Figure 610561DEST_PATH_IMAGE049
representing the first in the word set
Figure 981500DEST_PATH_IMAGE014
The score probability corresponding to the word;
the second part of the loss function, expressed correspondingly as:
Figure 993318DEST_PATH_IMAGE050
wherein,
Figure 816917DEST_PATH_IMAGE051
a penalty function representing the second part of the Bert model,
Figure 408436DEST_PATH_IMAGE052
representing the classifier parameters followed by the Encoder in the sentence prediction task,
Figure 381684DEST_PATH_IMAGE053
representing the first in the word set
Figure 564403DEST_PATH_IMAGE014
The relationship of the correspondence of the words,
Figure 140878DEST_PATH_IMAGE054
the kind of the relation is represented by,
Figure 67246DEST_PATH_IMAGE055
indicating that there is a relationship between two sentences that are adjacent to each other,
Figure 897930DEST_PATH_IMAGE056
indicating that there is no relationship between two sentences adjacent to each other,
Figure 782709DEST_PATH_IMAGE057
representing the first in the word set
Figure 580901DEST_PATH_IMAGE005
The score probability of the word correspondence;
the total loss function is expressed as:
Figure 310959DEST_PATH_IMAGE058
wherein,
Figure 996150DEST_PATH_IMAGE059
the total loss function of the Bert model is represented.
9. A chinese text correction system, the system comprising:
the target shielding module is used for acquiring an input original Chinese text, performing paragraph division on the original Chinese text to obtain a target paragraph, and shielding target words in the target paragraph by using shielding marks to obtain a shielded target paragraph;
the target coding module is used for coding the shielding target paragraph to form a sentence vector;
the coding prediction module is used for inputting the sentence vector into a Bert model, predicting a plurality of predicted character codes at the position of an occlusion mark according to the context of the target paragraph, and calculating the score of each predicted character code;
the score sorting module is used for sorting the score values of all the predicted character codes according to the order of the score values from high to low and acquiring the preset number of the predicted character codes which are sorted in the front;
the encoding and decoding module is used for decoding the predicted character codes with the preset number which are ranked in the front to obtain the corresponding predicted characters with the preset number and form a predicted character set;
the character judgment module is used for judging whether the character of the target character exists in the predicted character set or not;
the first generation module is used for generating a candidate character set according to the characters of the target words and in combination with a preset shape-similarity dictionary if the characters of the target words do not exist in the predicted character set;
the second generation module is used for generating an error correction set according to the predicted character set and the candidate character set, wherein the error correction set is an intersection of the predicted character set and the candidate character set;
the target replacement module is used for selecting only one word stored in the error correction set to replace the target word if only one word exists in the error correction set; if a plurality of characters exist in the error correction set, replacing the target characters respectively, calculating the confusion score of the sentence after each replacement, and taking the sentence with the highest confusion score as a final error correction result;
wherein, the calculation formula of the confusion score is expressed as:
Figure 786251DEST_PATH_IMAGE001
wherein,
Figure 71739DEST_PATH_IMAGE002
a score indicative of the degree of confusion is provided,
Figure 339909DEST_PATH_IMAGE003
which represents the current sentence in question,
Figure 145185DEST_PATH_IMAGE004
representing the total number of words in the sentence,
Figure 309451DEST_PATH_IMAGE005
representing the sequence number of the words in the sentence,
Figure 816655DEST_PATH_IMAGE006
before showing
Figure 888516DEST_PATH_IMAGE004
The probability of a word or words,
Figure 63146DEST_PATH_IMAGE007
representation is based on front
Figure 513211DEST_PATH_IMAGE008
Word calculation to obtain the first
Figure 38870DEST_PATH_IMAGE005
The probability of a word.
10. A readable storage medium on which a computer program is stored, wherein the program, when executed by a processor, implements the chinese text error correction method of any one of claims 1 to 8.
CN202211118545.3A 2022-09-15 2022-09-15 Chinese text error correction method, system and readable storage medium Pending CN115204151A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211118545.3A CN115204151A (en) 2022-09-15 2022-09-15 Chinese text error correction method, system and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211118545.3A CN115204151A (en) 2022-09-15 2022-09-15 Chinese text error correction method, system and readable storage medium

Publications (1)

Publication Number Publication Date
CN115204151A true CN115204151A (en) 2022-10-18

Family

ID=83571789

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211118545.3A Pending CN115204151A (en) 2022-09-15 2022-09-15 Chinese text error correction method, system and readable storage medium

Country Status (1)

Country Link
CN (1) CN115204151A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101010A (en) * 2020-11-23 2020-12-18 中博信息技术研究院有限公司 Telecom industry OA office automation manuscript auditing method based on BERT
CN113076739A (en) * 2021-04-09 2021-07-06 厦门快商通科技股份有限公司 Method and system for realizing cross-domain Chinese text error correction
US20220028371A1 (en) * 2020-07-27 2022-01-27 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for processing speech dialogues
CN114611494A (en) * 2022-03-17 2022-06-10 平安科技(深圳)有限公司 Text error correction method, device, equipment and storage medium
CN114863429A (en) * 2022-03-16 2022-08-05 来也科技(北京)有限公司 Text error correction method and training method based on RPA and AI and related equipment thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220028371A1 (en) * 2020-07-27 2022-01-27 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for processing speech dialogues
CN112101010A (en) * 2020-11-23 2020-12-18 中博信息技术研究院有限公司 Telecom industry OA office automation manuscript auditing method based on BERT
CN113076739A (en) * 2021-04-09 2021-07-06 厦门快商通科技股份有限公司 Method and system for realizing cross-domain Chinese text error correction
CN114863429A (en) * 2022-03-16 2022-08-05 来也科技(北京)有限公司 Text error correction method and training method based on RPA and AI and related equipment thereof
CN114611494A (en) * 2022-03-17 2022-06-10 平安科技(深圳)有限公司 Text error correction method, device, equipment and storage medium

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
CHANGCHANG ZENG,ET AL.: "Analyzing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets", 《WIRELESS COMMUNICATIONS AND MOBILE COMPUTING》 *
GUANGLI ZHU,ET AL.: "Causality Extraction Model Based on Two-stage GCN", 《RESEARCH SPUARE》 *
刘祥龙 等著: "《飞桨PaddlePaddle深度学习实战》", 31 August 2020, 机械工业出版社 *
华果才让 等: "面向汉藏机器翻译后处理的藏文虚词纠错模型", 《计算机仿真》 *
朱晨光 编著: "《机器阅读理解:算法与实践》", 30 April 2020, 机械工业出版社 *
高扬 著: "《智能摘要与深度学习》", 30 April 2019, 北京理工大学出版社 *

Similar Documents

Publication Publication Date Title
CN110489760B (en) Text automatic correction method and device based on deep neural network
CN110196894B (en) Language model training method and language model prediction method
CN108595410B (en) Automatic correction method and device for handwritten composition
US7917350B2 (en) Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
CN107273358B (en) End-to-end English chapter structure automatic analysis method based on pipeline mode
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
Mabona et al. Neural generative rhetorical structure parsing
CN110427619B (en) Chinese text automatic proofreading method based on multi-channel fusion and reordering
Jemni et al. Out of vocabulary word detection and recovery in Arabic handwritten text recognition
CN112200664A (en) Repayment prediction method based on ERNIE model and DCNN model
CN114863429A (en) Text error correction method and training method based on RPA and AI and related equipment thereof
CN114564912B (en) Intelligent document format checking and correcting method and system
Schaback et al. Multi-level feature extraction for spelling correction
CN112214994B (en) Word segmentation method, device and equipment based on multi-level dictionary and readable storage medium
JP5097802B2 (en) Japanese automatic recommendation system and method using romaji conversion
CN116956946B (en) Machine translation text fine granularity error type identification and positioning method
CN114970554B (en) Document checking method based on natural language processing
CN115204151A (en) Chinese text error correction method, system and readable storage medium
CN114925175A (en) Abstract generation method and device based on artificial intelligence, computer equipment and medium
CN115310432A (en) Wrongly written character detection and correction method
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
CN114896973A (en) Text processing method and device and electronic equipment
Mohapatra et al. Spell checker for OCR
Shitaoka et al. Dependency structure analysis and sentence boundary detection in spontaneous Japanese

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20221018