WO2023173533A1 - Text error correction method and apparatus, device, and storage medium - Google Patents

Text error correction method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2023173533A1
WO2023173533A1 PCT/CN2022/089175 CN2022089175W WO2023173533A1 WO 2023173533 A1 WO2023173533 A1 WO 2023173533A1 CN 2022089175 W CN2022089175 W CN 2022089175W WO 2023173533 A1 WO2023173533 A1 WO 2023173533A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
error correction
model
string
text information
Prior art date
Application number
PCT/CN2022/089175
Other languages
French (fr)
Chinese (zh)
Inventor
姜鹏
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023173533A1 publication Critical patent/WO2023173533A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present application relates to the field of language processing technology, and in particular to a text error correction method, device, equipment and storage medium.
  • Text error correction refers to the use of machine learning and natural language processing technology to automatically correct text information.
  • Text error correction models used in the prior art are divided into large-volume models and small-volume models. Large-volume models are usually designed to be larger, have high memory requirements, and have time delays in the actual error correction process; small-volume models Only the final correction result can be output, but the specific error location and error type are not given.
  • the inventor realized that the existing text error correction technology had technical problems in that it was unable to provide specific error locations and error types during the error correction process, and could not visually display the error correction content.
  • the main purpose of this application is to provide a text error correction method, device, equipment and storage medium to solve the problem that existing error correction solutions cannot provide specific error locations and error types during the error correction process, and cannot correct the error correction content. Visual display of the problem.
  • a first aspect of this application provides a text error correction method.
  • the text error correction method includes: obtaining text data to be corrected, and preprocessing the text data to be corrected to obtain text information; converting the text The information is input to the pre-trained text error correction model for text error correction processing, and the text error correction result corresponding to the text information is obtained.
  • the text error correction model is a sequence-to-sequence model of a hybrid architecture, and the encoder part adopts Transformer model architecture, the decoder part adopts the long short-term memory model architecture; according to the minimum edit distance algorithm, calculate the minimum edit distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; for all The characters included in the text information and the corresponding characters included in the text error correction result are mapped according to the minimum edit distance to obtain text error correction opinions.
  • a second aspect of this application provides a text error correction device, including: a preprocessing module for obtaining text data to be corrected, and preprocessing the text data to be corrected to obtain text information; text error correction processing A module for inputting the text information into a pre-trained text error correction model for text error correction processing to obtain a text error correction result corresponding to the text information; a minimum edit distance calculation module for calculating the text information the minimum edit distance between the text information and the corresponding text error correction result; a mapping processing module for mapping the text information and the corresponding text error correction result according to the minimum edit distance, Get text correction comments.
  • a third aspect of the present application provides a computer device, including: a memory and at least one processor, instructions stored in the memory; the at least one processor calls the instructions in the memory, so that the computer
  • the device performs the following steps: obtains text data to be corrected, preprocesses the text data to be corrected, and obtains text information; inputs the text information into a pre-trained text error correction model to perform text error correction processing, Obtain the text error correction result corresponding to the text information, wherein the text error correction model is a sequence-to-sequence model of a hybrid architecture, the encoder part adopts the Transformer model architecture, and the decoder part adopts the long short-term memory model architecture; according to the minimum editing Distance algorithm, calculates the minimum edit distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; corrects the characters contained in the text information and the corresponding text error correction The characters included in the result are mapped according to the minimum edit distance, and text error correction opinions are obtained. .
  • the fourth aspect of the present application provides a computer-readable storage medium. Instructions are stored in the computer-readable storage medium. When run on a computer, the computer is caused to perform the following steps: obtain text data to be corrected, and Preprocess the text data to be corrected to obtain text information; input the text information into the pre-trained text error correction model for text error correction processing to obtain text error correction results corresponding to the text information, where , the text error correction model is a hybrid architecture sequence-to-sequence model, the encoder part adopts the Transformer model architecture, and the decoder part adopts the long short-term memory model architecture; according to the minimum edit distance algorithm, the character sum contained in the text information is calculated The minimum edit distance between the characters included in the corresponding text error correction result; the characters included in the text information and the corresponding characters included in the text error correction result are mapped according to the minimum edit distance Process and get text error correction opinions. .
  • the method specifically preprocesses text data to be corrected, obtains text information, and then inputs it into a pre-trained text error correction model for text error correction processing, and obtains text error correction results corresponding to the text information.
  • the minimum edit distance algorithm calculate the minimum edit distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; correct the characters contained in the text information and the corresponding text error correction
  • the characters included in the result are mapped according to the minimum edit distance to obtain text error correction opinions; the error correction opinions of this article are obtained by calculating the minimum edit distance to reflect the relationship between the wrong content and the correct content, and give the error content in the text position in the error correction process to facilitate users to make real-time adjustments, solving the problem that existing error correction solutions cannot provide specific error locations and error types during the error correction process, and cannot visually display error correction content.
  • Figure 1 is a schematic diagram of a first embodiment of a text error correction method in the embodiment of the present application
  • Figure 2 is a schematic diagram of the second embodiment of the text error correction method in the embodiment of the present application.
  • Figure 3 is a schematic diagram of the third embodiment of the text error correction method in the embodiment of the present application.
  • Figure 4 is a schematic diagram of an embodiment of the text error correction device in the embodiment of the present application.
  • Figure 5 is a schematic diagram of another embodiment of the text error correction device in the embodiment of the present application.
  • Figure 6 is a schematic diagram of an embodiment of a computer device in an embodiment of the present application.
  • this application provides a text error correction method.
  • This method specifically preprocesses the text data to be corrected, obtains the text information, and then inputs it into the pre-trained text error correction model for text error correction processing to obtain the text error correction results corresponding to the text information; according to the minimum edit distance algorithm, Calculate the minimum edit distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; the characters contained in the text information and the characters contained in the corresponding text error correction result are calculated according to the minimum edit distance
  • the edit distance is mapped to obtain text error correction opinions; the article error correction opinions are obtained by calculating the minimum edit distance to reflect the relationship between the wrong content and the correct content, and give the location of the wrong content in the text to facilitate the user Real-time adjustments are made to solve the problem that the existing error correction scheme cannot provide specific error locations and error types during the error correction process, and cannot visually display the error correction content.
  • the pretreatment includes operations such as cleaning and classification, specifically:
  • the text data after data cleaning is classified according to the preset text categories to obtain different categories of text information.
  • the text error correction model is a sequence-to-sequence model with a hybrid architecture.
  • the encoder part adopts the Transformer model architecture, and the decoder part adopts the long short-term memory model architecture.
  • the text information is input into the pre-trained text error correction model, and the text information is encoded by an encoder using the Transformer model architecture to obtain text encoding;
  • the attention values are spliced and combined according to the preset method in the text error correction model to obtain a set of attention values
  • the perplexity value calculation algorithm is called through the long short-term memory model, and the attention values contained in the attention value set are iteratively calculated to obtain the corresponding perplexity;
  • the text encoding is decoded based on the probability prediction result to obtain a text error correction result.
  • linear transformation and projection processing are performed on the text encoding according to the attention mechanism, and the attention value corresponding to the text encoding is calculated, including:
  • Residual linking is performed on the text encoding through the sub-layer in the encoder, the input text encoding is linearly superposed based on nonlinear transformation, and the processing results are normalized;
  • linear transformation and projection processing are performed on the text encoding according to the attention mechanism, and the attention value corresponding to the text encoding is calculated, including:
  • a multi-head self-attention mechanism is used to linearly transform the text encoding and project to different dimensions under the attention mechanism. Specifically, the following formula is used for processing:
  • Zl is a simple linear layer operation
  • Lin is a linear combination, used for linear transformation of encoding
  • Y1 is the parameter corresponding to the text encoding
  • T is the corresponding modified word vector
  • Xl is the predicted word currently output
  • Cl cooperates Participate in the operation to determine the information output to the next convolutional layer.
  • ET is the transpose of the text encoding corresponding to the sentence to be corrected in the hidden layer of the encoder.
  • E is the output result of the sentence encoder to be corrected.
  • S is the input.
  • the word vector of SoftMax is the activation layer. In the above formula, it means that the operation process in parentheses is implemented through the activation layer.
  • the attention mechanism can also be used to process the text encoding.
  • the attention values are spliced and combined according to the preset method in the text error correction model to obtain a set of attention values, including:
  • the encoding vector is projected onto the preset dimensions Q, K, V of the multi-head attention mechanism, and finally the different attention value (attention) results are spliced together to obtain the attention value set, expressed as :
  • head1,..., headh represent each attention head (head) in the multi-head attention mechanism
  • WO represents the preset parameters for converting the projection results of each attention head (head).
  • the perplexity value calculation algorithm is called through the long short-term memory model, and the attention values contained in the attention value set are iteratively calculated to obtain the corresponding perplexity, including calculating the perplexity in the following manner perplexity:
  • PP(W) represents the perplexity value of sentence W.
  • W ⁇ 1 , ⁇ 2 and ⁇ N all represent the attention value of the word vector corresponding to sentence W.
  • the subscript N of ⁇ N represents the attention selected for the current iteration calculation.
  • the value number, P( ⁇ 1 ⁇ 2 ... ⁇ N ), represents the sentence probability calculated by the attention value.
  • the perplexity value is obtained by iteratively calculating all attention values using the calculation method represented by the formula.
  • probability prediction is performed on text encoding according to the perplexity, and a probability prediction result is obtained, including:
  • each sentence in the text to be corrected is less than the preset perplexity threshold, then it is determined that each sentence in the text to be corrected is a sentence that does not require error correction;
  • each sentence in the text to be corrected is determined to be a sentence that needs to be corrected
  • probability prediction is performed on the text encoding through the long short-term memory model to obtain a probability prediction result.
  • the pre-trained text error correction model is trained in the following way:
  • the initial training model is quantized to obtain a text error correction model.
  • factoring the parameters of the embedding layer includes:
  • the hybrid architecture model is learned and trained based on the training data set to obtain an initial training model, including:
  • Classify text data with error correction information in a preset manner for example, classify text data according to language into Chinese, English and special symbols;
  • Split and combine the text data with error correction information in a preset way to build a training data set. For example, split the text information into sentences, combine the error correction information with the original text according to the corresponding relationship, and obtain the training data set;
  • the hybrid architecture model is learned and trained based on the training data set to obtain an initial training model.
  • the text error correction model is obtained by performing quantification processing on the initial training model, including:
  • Quantify the initial training model through preset quantization algorithm models such as: Deepcompression, Binary-Net, Tenary-Net and Dorefa-Net;
  • each 32-bit floating point weight can be stored using 8 bits. Although this means that each weight is stored with lower fidelity, the quality of the model is not be significantly affected.
  • the minimum edit distance algorithm calculate the minimum edit distance between the string contained in the text information and the string contained in the text error correction result
  • This step includes:
  • the string is converted into a character matrix with the correspondence, wherein the character matrix contains characters of all characters in the string Eigenvalues;
  • an edit distance operation is performed on each character feature value in the character matrix to obtain the minimum edit distance between the characters included in the text information and the characters included in the corresponding text error correction result.
  • splitting the character set according to a preset splitting method to form a string includes:
  • the character set is split according to the correspondence between characters and sentences in the text to obtain a string, in which the characters contained in a string belong to the same sentence.
  • the character set may also be split according to phrases formed based on specific grammar.
  • the character set may be split into phrases containing at least one verb and at least one noun.
  • the dynamic programming equation is constructed according to the preset editing operation type, including:
  • the dynamic programming equation can be constructed in the following way:
  • edit[i][j] represents the edit distance of string A starting from the 0th character to the i-th character and string B starting from the 0th character to the j-th character; the subscript of the string Start from 1;
  • dis[0][0] means that when both word1 and word2 are empty, the Edit Distance between them is 0. It can be concluded that dis[0][j] is the case where word1 is empty and the length of word2 is j. At this time, their Edit Distance is j, that is, the minimum Edit Distance to convert from empty to word2 by adding j characters is j. ;Similarly, dis[i][0] means that when the length of word1 is i and word2 is empty, word1 needs to delete i characters before it can be converted into empty, so the minimum Edit Distance converted into word2 is i;
  • the dynamic programming equation is constructed as follows:
  • edit[i-1][j]+1 is equivalent to inserting the last character of word1 into the end of word2.
  • the insertion operation makes edit+1, and then edit[i-1][j] is calculated;
  • edit[i][j-1]+1 is equivalent to deleting the last character of word2, deleting the operation edit+1, and then calculating edit[i][j-1];
  • edit[i-1][j-1]+flag is equivalent to replacing the last character of word2 with the last character of word1; the flag mark represents the valid number of substitutions.
  • This step specifically includes:
  • the preset editing operation type and the minimum editing distance between the strings in the string corresponding group according to the editing direction of editing the string in the text information into the string in the text error correction result, all the strings in the text error correction result are edited.
  • the above minimum editing distance is converted into an editing operation sequence, where the editing operation sequence includes the editing position, editing operation type and editing sequence involved in editing the characters in the string.
  • the preset editing operation type is to delete a
  • the minimum editing distance is converted into an editing operation sequence including the editing position, type of editing operation, and editing order of the editing operations of deleting a character, inserting a character, and modifying a character.
  • the text information is obtained and then input into the pre-trained text error correction model for text error correction processing, and the text error correction results corresponding to the text information are obtained; according to the minimum edit
  • the distance algorithm calculates the minimum editing distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; the characters contained in the text information and the corresponding characters contained in the text error correction result are Characters are mapped according to the minimum edit distance to obtain text error correction opinions; the error correction opinions of this article are obtained by calculating the minimum edit distance to reflect the relationship between the incorrect content and the correct content, and give the location of the incorrect content in the text.
  • the noise in the text data is specifically removed to obtain the text data after the noise is removed; the text form of the text data after the noise is removed is converted into a preset text form, and the converted format is obtained.
  • Text data Classify and filter the converted text data according to preset categories and characteristics to obtain text information.
  • this step includes:
  • compression encoding can be used for encoding processing.
  • Compression encoding is a encoding method that can compress the data size, such as BPE encoding (byte pairencoder, double byte coding).
  • BPE encoding also called digram coding, is mainly for data compression.
  • the BPE encoding method is mainly an iterative process of replacing the most common pair of characters in a string with a character that does not appear in this character. For example, when the word in the initial text is "student", the character “A” can be used to replace the character "stu”, and the character "B” can be used to replace "dent", then the word "student” can be encoded as "AB".
  • compression encoding can be performed in units of words, phrases, sentences, etc., to obtain the text encoding corresponding to the entire text information.
  • the encoding method based on the attention mechanism can also be used. Specifically, by extracting the feature information in the text information and converting it into a feature vector, and then encoding the feature vector based on the attention mechanism to obtain the corresponding information of the entire text information. Text encoding.
  • this step includes:
  • the error correction vocabulary contains text encodings of sentences with errors and corresponding error corrections. the text encoding of the text;
  • the text encoding is decoded according to the decoding rules of the long and short-term memory model to obtain the text error correction result.
  • the minimum edit distance algorithm calculate the minimum edit distance between the sentences contained in the text information and the sentences contained in the text error correction results
  • this step includes:
  • the text information and text error correction results are obtained separately, split according to sentences, and converted into a string set.
  • the string set includes a string set corresponding to the sentences contained in the text information and a string set corresponding to the sentences contained in the text error correction results.
  • a dynamic programming equation is constructed, and the edit distance operation is performed on the string set to obtain the minimum edit distance between strings.
  • the minimum edit distance represents the sentences contained in the text information and the text error correction results. Minimum edit distance between contained sentences.
  • this step includes:
  • Convert the minimum editing distance into an editing operation sequence according to a preset editing operation type For example, when the preset editing operation type is deleting a character, inserting a character, and modifying a character, convert the minimum editing distance into a sequence. into an editing operation sequence including the editing operations of deleting a character, inserting a character, and modifying a character;
  • the text information is obtained and then input into the pre-trained text error correction model for text error correction processing, and the text error correction results corresponding to the text information are obtained; according to the minimum edit
  • the distance algorithm calculates the minimum editing distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; the characters contained in the text information and the corresponding characters contained in the text error correction result are Characters are mapped according to the minimum edit distance to obtain text error correction opinions; the error correction opinions of this article are obtained by calculating the minimum edit distance to reflect the relationship between the incorrect content and the correct content, and give the location of the incorrect content in the text.
  • the feature information (Source) in the text information is extracted and converted into a feature vector, and represented as a data pair containing an address (Key) and a value (Value) through the attention mechanism ⁇ Key, Value>, at this time, given a query element (Query) in the target (Target), by calculating the similarity (similarity) between Query and each Key, the weight coefficient of each Key corresponding to the Value is obtained, and then the Value is Weighted summation is used to obtain the Attention value; in practical applications, the Attention mechanism is used and the following formula is used to perform a weighted summation of the Value values of the elements in the Source, in which Query and Key are used to calculate the weight coefficient of the corresponding Value.
  • Attention represents the Attention value that needs to be calculated during the process of converting feature information (Source) into Query elements in the target (Target) through the Attention mechanism
  • Similarity represents the correlation between Query and each Key
  • Lx represents The length of Source, the subscript i of Key and Value represent its number in the data pair ⁇ Key, Value>, which facilitates weighted summation.
  • encoding processing based on the attention mechanism is performed to obtain the text encoding corresponding to the entire text information.
  • the perplexity value calculation algorithm is called through the long short-term memory model to calculate the perplexity value corresponding to the text encoding, including using the following formula to calculate the perplexity value (perplexity):
  • PP(W) represents the perplexity value of sentence W.
  • W ⁇ 1 , ⁇ 2 and ⁇ N all represent the text codes corresponding to the words contained in sentence W.
  • the subscript N of ⁇ N represents the words selected in the current iterative calculation.
  • the corresponding text encoding range, P( ⁇ 1 ⁇ 2 ... ⁇ N ), represents the probability of containing a sentence.
  • each sentence in the text to be corrected is less than the preset perplexity threshold, then it is determined that each sentence in the text to be corrected is a sentence that does not require error correction;
  • each sentence in the text to be corrected is determined to be a sentence that needs to be corrected
  • the text encoding is probabilistically predicted through the long short-term memory model and replaced with the predicted text encoding to obtain a probabilistic prediction result
  • the text is encoded and decoded into text form to obtain text error correction results.
  • the minimum edit distance algorithm calculate the minimum edit distance between the sentences contained in the text information and the sentences contained in the text error correction results
  • this step includes:
  • the text information and text error correction results are obtained separately, split according to sentences, and converted into a string set.
  • the string set includes a string set corresponding to the sentences contained in the text information and a string set corresponding to the sentences contained in the text error correction results.
  • a dynamic programming equation is constructed, and the edit distance operation is performed on the string set to obtain the minimum edit distance between strings.
  • the minimum edit distance represents the sentences contained in the text information and the text error correction results. Minimum edit distance between contained sentences.
  • this step includes:
  • Convert the minimum editing distance into an editing operation sequence according to a preset editing operation type For example, when the preset editing operation type is deleting a character, inserting a character, and modifying a character, convert the minimum editing distance into a sequence. into an editing operation sequence including the editing operations of deleting a character, inserting a character, and modifying a character;
  • Output the text information and the corresponding editing operation sequence according to a preset output method to obtain text error correction opinions for example, output the editing operation sequence content in an interactive manner for the user to selectively correct. Specifically, It can be presented based on the text to be corrected. For the error correction content, different colors or fonts are displayed on the user interface, and the corresponding editing operation sequence information is output in the form of links or arrows to obtain text correction opinions. ;
  • the text error correction results as a basis to highlight the areas that are different from the text to be corrected on the user interface, and output the corresponding editing operation sequence information in the form of links or arrows. , get text correction opinions.
  • the text information is obtained and then input into the pre-trained text error correction model for text error correction processing, and the text error correction results corresponding to the text information are obtained; according to the minimum edit
  • the distance algorithm calculates the minimum editing distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; the characters contained in the text information and the corresponding characters contained in the text error correction result are Characters are mapped according to the minimum edit distance to obtain text error correction opinions; the error correction opinions of this article are obtained by calculating the minimum edit distance to reflect the relationship between the incorrect content and the correct content, and give the location of the incorrect content in the text.
  • the text error correction method in the embodiment of the present application is described above, and the text error correction device in the embodiment of the present application is described below. Please refer to Figure 4.
  • An example of the text error correction device in the embodiment of the present application includes:
  • the preprocessing module 401 is used to obtain the data to be corrected and preprocess the data to be corrected to obtain text information;
  • the text error correction processing module 402 is used to input the text information into a pre-trained text error correction model for text error correction processing, and obtain a text error correction result corresponding to the text information;
  • the minimum edit distance calculation module 403 is used to calculate the minimum edit distance between the string contained in the text information and the string contained in the text error correction result according to the minimum edit distance algorithm;
  • the mapping processing module 404 is configured to perform mapping processing on the string contained in the text information and the string contained in the text error correction result according to the minimum edit distance to obtain text error correction opinions.
  • the text information is obtained and then input into the pre-trained text error correction model for text error correction processing, and the text error correction results corresponding to the text information are obtained; according to the minimum edit
  • the distance algorithm calculates the minimum editing distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; the characters contained in the text information and the corresponding characters contained in the text error correction result are Characters are mapped according to the minimum edit distance to obtain text error correction opinions; the error correction opinions of this article are obtained by calculating the minimum edit distance to reflect the relationship between the incorrect content and the correct content, and give the location of the incorrect content in the text.
  • Another embodiment of the text error correction device in the embodiment of the present application includes:
  • the preprocessing module 401 is used to obtain the data to be corrected and preprocess the data to be corrected to obtain text information;
  • Text error correction module 402 used to determine whether the length of text information is greater than a preset length threshold
  • the minimum edit distance calculation module 403 is used to call the preset text summary extraction algorithm to streamline the text information to obtain summary data of the text information when the length is judged to be greater than the preset length threshold;
  • the mapping processing module 404 is used to input the summary data into the pre-trained text error correction model for text error correction processing to obtain the emotional information in the data to be corrected;
  • Model training module 405 used to extract the encoder in the Transformer model framework and the decoder in the long short-term memory model framework; share embedding layer parameters between the encoder and the decoder, and perform embedding layer parameter sharing. Factoring, constructing a hybrid architecture model; constructing a training data set from text data with error correction information, learning and training the hybrid architecture model based on the training data set, and obtaining an initial training model; performing training on the initial training model Perform quantification processing to obtain a text error correction model.
  • the text error correction module 402 includes:
  • Text encoding unit 4021 used to encode the text information through an encoder using the Transformer model architecture to obtain text encoding
  • the first calculation unit 4022 is used to perform linear transformation and projection processing on the text encoding according to the attention mechanism, and calculate the attention value corresponding to the text encoding;
  • the second calculation unit 4023 is used to call the perplexity value calculation algorithm, iteratively calculate the attention values contained in the attention value set, and obtain the corresponding perplexity;
  • Probability prediction unit 4024 used to perform probability prediction on text encoding according to the degree of confusion, and obtain probability prediction results
  • Text decoding unit 4025 used to decode the text encoding according to the probability prediction results to obtain text error correction results
  • the minimum edit distance calculation module 403 includes:
  • the character conversion unit 4031 is used to extract the text information and all characters in the corresponding text error correction result to form a character set, and split the character set according to a preset splitting method to form a character set. string; and according to the correspondence between the text information and the text error correction result, convert the string into a character matrix with the correspondence;
  • Dynamic programming unit 4032 used to construct dynamic programming equations according to the preset editing operation type
  • the third calculation unit 4033 is used to perform an edit distance operation on each character feature value in the character matrix to obtain the minimum edit distance between the string contained in the text information and the string contained in the text error correction result;
  • mapping processing module 404 includes:
  • Mapping unit 4041 configured to perform mapping processing on the string contained in the text information and the string contained in the text error correction result according to the minimum edit distance, to obtain a string corresponding group;
  • Sequence generation unit 4042 configured to edit the corresponding string in the text information into the text error correction result according to the preset editing operation type and the minimum editing distance between strings in the string corresponding group.
  • the editing direction of the string in construct the editing operation sequence;
  • the opinion output unit 4043 is used to output the editing operation sequence including the text information and the string contained therein according to the preset output mode, and obtain text error correction opinions;
  • model training module 405 includes:
  • the training data set generation unit 4051 is used to collect text data and construct a training data set according to a preset method
  • the training unit 4052 is used to cyclically input the training data set into the hybrid architecture model through a hard distillation loop, obtain the corresponding training results through the encoding and decoding operation of the model to be trained, and determine whether the training results meet the preset condition, if yes, terminate the loop and output the initial training model.
  • the text information is obtained and then input into the pre-trained text error correction model for text error correction processing, and the text error correction results corresponding to the text information are obtained; according to the minimum edit
  • the distance algorithm calculates the minimum editing distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; the characters contained in the text information and the corresponding characters contained in the text error correction result are Characters are mapped according to the minimum edit distance to obtain text error correction opinions; the error correction opinions of this article are obtained by calculating the minimum edit distance to reflect the relationship between the incorrect content and the correct content, and give the location of the incorrect content in the text.
  • FIG. 6 an embodiment of the computer device in the embodiment of the present application is described in detail below from the perspective of hardware processing.
  • FIG. 6 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the computer device 600 may vary greatly due to different configurations or performance, and may include one or more processors (central processing units, CPU) 610 (eg, one or more processors) and memory 620, one or more storage media 630 (eg, one or more mass storage devices) storing applications 633 or data 632.
  • the memory 620 and the storage medium 630 may be short-term storage or persistent storage.
  • the program stored in the storage medium 630 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the computer device 600 .
  • the processor 610 may be configured to communicate with the storage medium 630 and execute a series of instruction operations in the storage medium 630 on the computer device 600 .
  • Computer device 600 may also include one or more power supplies 640, one or more wired or wireless network interfaces 650, one or more input and output interfaces 660, and/or, one or more operating systems 631, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD and more.
  • operating systems 631 such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD and more.
  • the computer-readable storage medium can be a non-volatile computer-readable storage medium.
  • the computer-readable storage medium can also be a volatile computer-readable storage medium.
  • the computer-readable storage medium can be a non-volatile computer-readable storage medium.
  • the text error correction model is a sequence-to-sequence model of a hybrid architecture, and the encoder part adopts Transformer model architecture, the decoder part adopts the long short-term memory model architecture; according to the minimum edit distance algorithm, calculate the minimum edit distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; for all The characters included in the text information and the corresponding characters included in the text error correction result are mapped according to the minimum edit distance to obtain text error correction opinions.
  • Artificial intelligence is the use of digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire Knowledge and use of knowledge to achieve optimal results in theories, methods, techniques and application systems. Specifically, it can be executed based on a server.
  • the server can be an independent server, or it can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, and security services. , Content Delivery Network (CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
  • CDN Content Delivery Network
  • Integrated units may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products.
  • the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods of various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program code. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The present application relates to the technical field of language processing. Disclosed are a text error correction method and apparatus, a device, and a storage medium. The method comprises: preprocessing text data to be error-corrected to obtain text information; then inputting the text information into a pre-trained text error correction model for text error correction processing to obtain a text error correction result corresponding to the text information; calculating, according to a minimum edit distance algorithm, a minimum edit distance between characters comprised in the text information and characters comprised in the text error correction result corresponding to the text information; and performing mapping processing on the characters comprised in the text information and the characters comprised in the text error correction result corresponding to the text information according to the minimum edit distance to obtain a text error correction opinion. The text error correction opinion is obtained by calculating the minimum edit distance, so as to reflect a relationship between incorrect content and correct content, and the position of the incorrect content in a text is provided, so that a user can perform an adjustment in real time.

Description

文本纠错方法、装置、设备及存储介质Text error correction method, device, equipment and storage medium
本申请要求于2022年03月17日提交中国专利局、申请号为202210262506.4、发明名称为“文本纠错方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on March 17, 2022, with application number 202210262506.4 and the invention title "Text Error Correction Method, Device, Equipment and Storage Medium", the entire content of which is incorporated by reference. In application.
技术领域Technical field
本申请涉及语言处理技术领域,尤其涉及一种文本纠错方法、装置、设备及存储介质。The present application relates to the field of language processing technology, and in particular to a text error correction method, device, equipment and storage medium.
背景技术Background technique
文本纠错指的是利用机器学习与自然语言处理技术自动对文本信息进行纠错处理。现有技术中所采用的文本纠错模型分为大体量模型和小体量模型,其中大体量模型通常设计地较大,内存要求高,而且实际纠错过程上具有时延;小体量模型只能输出最后的修正结果,但是没有给出具体的错误位置和错误类型。Text error correction refers to the use of machine learning and natural language processing technology to automatically correct text information. Text error correction models used in the prior art are divided into large-volume models and small-volume models. Large-volume models are usually designed to be larger, have high memory requirements, and have time delays in the actual error correction process; small-volume models Only the final correction result can be output, but the specific error location and error type are not given.
对此,发明人意识到,现有的文本纠错技术存在着在纠错过程中无法给出具体的错误位置和错误类型,不能对纠错内容直观显示的技术问题。In this regard, the inventor realized that the existing text error correction technology had technical problems in that it was unable to provide specific error locations and error types during the error correction process, and could not visually display the error correction content.
发明内容Contents of the invention
本申请的主要目的是提供一种文本纠错方法、装置、设备及存储介质,以解决现有的纠错方案在纠错过程中无法给出具体的错误位置和错误类型,无法对纠错内容直观显示的问题。The main purpose of this application is to provide a text error correction method, device, equipment and storage medium to solve the problem that existing error correction solutions cannot provide specific error locations and error types during the error correction process, and cannot correct the error correction content. Visual display of the problem.
本申请第一方面提供了一种文本纠错方法,所述文本纠错方法包括:获取待纠错文本数据,并对所述待纠错文本数据进行预处理,得到文本信息;将所述文本信息输入至预先训练得到的文本纠错模型进行文本纠错处理,得到所述文本信息对应的文本纠错结果,其中,所述文本纠错模型为混合架构的序列到序列模型,编码器部分采用Transformer模型架构,解码器部分采用长短时记忆模型架构;根据最小编辑距离算法,计算所述文本信息中包含的字符和与其对应的文本纠错结果中包含的字符之间的最小编辑距离;对所述文本信息中包含的字符和与之对应的所述文本纠错结果中包含的字符按照所述最小编辑距离进行映射处理,得到文本纠错意见。A first aspect of this application provides a text error correction method. The text error correction method includes: obtaining text data to be corrected, and preprocessing the text data to be corrected to obtain text information; converting the text The information is input to the pre-trained text error correction model for text error correction processing, and the text error correction result corresponding to the text information is obtained. The text error correction model is a sequence-to-sequence model of a hybrid architecture, and the encoder part adopts Transformer model architecture, the decoder part adopts the long short-term memory model architecture; according to the minimum edit distance algorithm, calculate the minimum edit distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; for all The characters included in the text information and the corresponding characters included in the text error correction result are mapped according to the minimum edit distance to obtain text error correction opinions.
本申请第二方面提供了一种文本纠错装置,包括:预处理模块,用于获取待纠错文本数据,并对所述待纠错文本数据进行预处理,得到文本信息;文本纠错处理模块,用于将所述文本信息输入至预先训练得到的文本纠错模型进行文本纠错处理,得到所述文本信息对应的文本纠错结果;最小编辑距离计算模块,用于计算所述文本信息和与之对应的所述文本纠错结果之间的最小编辑距离;映射处理模块,用于对所述文本信息和与之对应的所述文本纠错结果按照所述最小编辑距离进行映射处理,得到文本纠错意见。A second aspect of this application provides a text error correction device, including: a preprocessing module for obtaining text data to be corrected, and preprocessing the text data to be corrected to obtain text information; text error correction processing A module for inputting the text information into a pre-trained text error correction model for text error correction processing to obtain a text error correction result corresponding to the text information; a minimum edit distance calculation module for calculating the text information the minimum edit distance between the text information and the corresponding text error correction result; a mapping processing module for mapping the text information and the corresponding text error correction result according to the minimum edit distance, Get text correction comments.
本申请第三方面提供了一种计算机设备,包括:存储器和至少一个处理器,所述存储器中存储有指令;所述至少一个处理器调用所述存储器中的所述指令,以使得所述计算机设备执行以下步骤:获取待纠错文本数据,并对所述待纠错文本数据进行预处理,得到文本信息;将所述文本信息输入至预先训练得到的文本纠错模型进行文本纠错处理,得到所述文本信息对应的文本纠错结果,其中,所述文本纠错模型为混合架构的序列到序列模型,编码器部分采用Transformer模型架构,解码器部分采用长短时记忆模型架构;根据最小编辑距离算法,计算所述文本信息中包含的字符和与其对应的文本纠错结果中包含的字符之间的最小编辑距离;对所述文本信息中包含的字符和与之对应的所述文本纠错结果中包含的字符按照所述最小编辑距离进行映射处理,得到文本纠错意见。。A third aspect of the present application provides a computer device, including: a memory and at least one processor, instructions stored in the memory; the at least one processor calls the instructions in the memory, so that the computer The device performs the following steps: obtains text data to be corrected, preprocesses the text data to be corrected, and obtains text information; inputs the text information into a pre-trained text error correction model to perform text error correction processing, Obtain the text error correction result corresponding to the text information, wherein the text error correction model is a sequence-to-sequence model of a hybrid architecture, the encoder part adopts the Transformer model architecture, and the decoder part adopts the long short-term memory model architecture; according to the minimum editing Distance algorithm, calculates the minimum edit distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; corrects the characters contained in the text information and the corresponding text error correction The characters included in the result are mapped according to the minimum edit distance, and text error correction opinions are obtained. .
本申请的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行以下步骤:获取待纠错文本数据,并对所述待纠错文本数据进行预处理,得到文本信息;将所述文本信息输入至预先训练得到的 文本纠错模型进行文本纠错处理,得到所述文本信息对应的文本纠错结果,其中,所述文本纠错模型为混合架构的序列到序列模型,编码器部分采用Transformer模型架构,解码器部分采用长短时记忆模型架构;根据最小编辑距离算法,计算所述文本信息中包含的字符和与其对应的文本纠错结果中包含的字符之间的最小编辑距离;对所述文本信息中包含的字符和与之对应的所述文本纠错结果中包含的字符按照所述最小编辑距离进行映射处理,得到文本纠错意见。。The fourth aspect of the present application provides a computer-readable storage medium. Instructions are stored in the computer-readable storage medium. When run on a computer, the computer is caused to perform the following steps: obtain text data to be corrected, and Preprocess the text data to be corrected to obtain text information; input the text information into the pre-trained text error correction model for text error correction processing to obtain text error correction results corresponding to the text information, where , the text error correction model is a hybrid architecture sequence-to-sequence model, the encoder part adopts the Transformer model architecture, and the decoder part adopts the long short-term memory model architecture; according to the minimum edit distance algorithm, the character sum contained in the text information is calculated The minimum edit distance between the characters included in the corresponding text error correction result; the characters included in the text information and the corresponding characters included in the text error correction result are mapped according to the minimum edit distance Process and get text error correction opinions. .
本申请的技术方案中,该方法具体是通过对待纠错文本数据进行预处理,得到文本信息后输入至预先训练得到的文本纠错模型进行文本纠错处理,得到文本信息对应的文本纠错结果;根据最小编辑距离算法,计算文本信息中包含的字符和与其对应的文本纠错结果中包含的字符之间的最小编辑距离;对文本信息中包含的字符和与之对应的所述文本纠错结果中包含的字符按照最小编辑距离进行映射处理,得到文本纠错意见;通过计算最小编辑距离得到本文纠错意见,以体现出错误内容与正确内容之间的关系,并给出错误内容在文本中的位置,以便于用户进行实时调整,解决现有的纠错方案在纠错过程中无法给出具体的错误位置和错误类型,无法对纠错内容直观显示的问题。In the technical solution of this application, the method specifically preprocesses text data to be corrected, obtains text information, and then inputs it into a pre-trained text error correction model for text error correction processing, and obtains text error correction results corresponding to the text information. ; According to the minimum edit distance algorithm, calculate the minimum edit distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; correct the characters contained in the text information and the corresponding text error correction The characters included in the result are mapped according to the minimum edit distance to obtain text error correction opinions; the error correction opinions of this article are obtained by calculating the minimum edit distance to reflect the relationship between the wrong content and the correct content, and give the error content in the text position in the error correction process to facilitate users to make real-time adjustments, solving the problem that existing error correction solutions cannot provide specific error locations and error types during the error correction process, and cannot visually display error correction content.
附图说明Description of the drawings
图1为本申请实施例中为文本纠错方法的第一个实施例示意图;Figure 1 is a schematic diagram of a first embodiment of a text error correction method in the embodiment of the present application;
图2为本申请实施例中为文本纠错方法的第二个实施例示意图;Figure 2 is a schematic diagram of the second embodiment of the text error correction method in the embodiment of the present application;
图3为本申请实施例中为文本纠错方法的第三个实施例示意图;Figure 3 is a schematic diagram of the third embodiment of the text error correction method in the embodiment of the present application;
图4为本申请实施例中文本纠错装置的一个实施例示意图;Figure 4 is a schematic diagram of an embodiment of the text error correction device in the embodiment of the present application;
图5为本申请实施例中文本纠错装置的另一个实施例示意图;Figure 5 is a schematic diagram of another embodiment of the text error correction device in the embodiment of the present application;
图6为本申请实施例中计算机设备的一个实施例示意图。Figure 6 is a schematic diagram of an embodiment of a computer device in an embodiment of the present application.
具体实施方式Detailed ways
为了解决现有的纠错方案在纠错过程中无法给出具体的错误位置和错误类型,无法对纠错内容直观显示的问题,本申请提供了一种文本纠错方法。该方法具体是通过对待纠错文本数据进行预处理,得到文本信息后输入至预先训练得到的文本纠错模型进行文本纠错处理,得到文本信息对应的文本纠错结果;根据最小编辑距离算法,计算文本信息中包含的字符和与其对应的文本纠错结果中包含的字符之间的最小编辑距离;对文本信息中包含的字符和与之对应的所述文本纠错结果中包含的字符按照最小编辑距离进行映射处理,得到文本纠错意见;通过计算最小编辑距离得到本文纠错意见,以体现出错误内容与正确内容之间的关系,并给出错误内容在文本中的位置,以便于用户进行实时调整,解决现有的纠错方案在纠错过程中无法给出具体的错误位置和错误类型,无法对纠错内容直观显示的问题。In order to solve the problem that existing error correction solutions cannot provide specific error locations and error types during the error correction process, and cannot visually display error correction content, this application provides a text error correction method. This method specifically preprocesses the text data to be corrected, obtains the text information, and then inputs it into the pre-trained text error correction model for text error correction processing to obtain the text error correction results corresponding to the text information; according to the minimum edit distance algorithm, Calculate the minimum edit distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; the characters contained in the text information and the characters contained in the corresponding text error correction result are calculated according to the minimum edit distance The edit distance is mapped to obtain text error correction opinions; the article error correction opinions are obtained by calculating the minimum edit distance to reflect the relationship between the wrong content and the correct content, and give the location of the wrong content in the text to facilitate the user Real-time adjustments are made to solve the problem that the existing error correction scheme cannot provide specific error locations and error types during the error correction process, and cannot visually display the error correction content.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、***、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if present) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects without necessarily using Used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. In addition, the terms "comprising" or "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., processes, methods, systems, products, or devices that comprise a series of steps or units and are not necessarily limited to those expressly listed. steps or units, but may include other steps or units not expressly listed or inherent to such processes, methods, products or apparatuses.
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中文本纠错方法的第一个实施例,该方法的实现步骤如下:For ease of understanding, the specific process of the embodiment of the present application is described below. Please refer to Figure 1, which is the first embodiment of the text error correction method in the embodiment of the present application. The implementation steps of the method are as follows:
101、获取待纠错数据,并对待纠错数据进行预处理,得到文本信息;101. Obtain the data to be corrected and preprocess the data to be corrected to obtain text information;
该步骤中,所述预处理包括清洗和分类等操作,具体地:In this step, the pretreatment includes operations such as cleaning and classification, specifically:
对待纠错数据进行数据清洗,得到经过数据清洗后的文本数据;Perform data cleaning on the data to be corrected and obtain text data after data cleaning;
对经过数据清洗后的文本数据按照预设的文本类别进行分类,得到不同类别的文本信息。The text data after data cleaning is classified according to the preset text categories to obtain different categories of text information.
102、将文本信息输入至预先训练得到的文本纠错模型进行文本纠错处理,得到文本信息对应的文本纠错结果;102. Input the text information into the pre-trained text error correction model for text error correction processing, and obtain the text error correction results corresponding to the text information;
其中,所述文本纠错模型为混合架构的序列到序列模型,编码器部分采用Transformer模型架构,解码器部分采用长短时记忆模型架构。Among them, the text error correction model is a sequence-to-sequence model with a hybrid architecture. The encoder part adopts the Transformer model architecture, and the decoder part adopts the long short-term memory model architecture.
在本实施例中,包括:In this embodiment, it includes:
将所述文本信息输入至所述预先训练得到的文本纠错模型,通过采用所述Transformer模型架构的编码器对所述文本信息进行编码,得到文本编码;The text information is input into the pre-trained text error correction model, and the text information is encoded by an encoder using the Transformer model architecture to obtain text encoding;
根据注意力机制对所述文本编码进行线性变换和投影处理,计算所述文本编码对应的注意力值;Perform linear transformation and projection processing on the text encoding according to the attention mechanism, and calculate the attention value corresponding to the text encoding;
根据所述文本纠错模型中预设的方式对所述注意力值进行拼接组合,得到注意力值集合;The attention values are spliced and combined according to the preset method in the text error correction model to obtain a set of attention values;
通过长短时记忆模型调用困惑度值计算算法,对所述注意力值集合中包含的注意力值进行迭代计算,得到对应的困惑度;The perplexity value calculation algorithm is called through the long short-term memory model, and the attention values contained in the attention value set are iteratively calculated to obtain the corresponding perplexity;
通过所述长短时记忆模型,根据所述困惑度对文本编码进行概率预测,得到概率预测结果;Using the long short-term memory model, perform probability prediction on text encoding according to the perplexity, and obtain a probability prediction result;
通过所述长短时记忆模型,基于所述概率预测结果对所述文本编码进行解码,得到文本纠错结果。Through the long short-term memory model, the text encoding is decoded based on the probability prediction result to obtain a text error correction result.
在本实施例中,对于根据注意力机制对所述文本编码进行线性变换和投影处理,计算所述文本编码对应的注意力值,包括:In this embodiment, linear transformation and projection processing are performed on the text encoding according to the attention mechanism, and the attention value corresponding to the text encoding is calculated, including:
通过所述编码器中的子层对所述文本编码进行残差链接,将输入的文本编码进行基于非线性变换的线性叠加,并对处理结果进行归一化处理;Residual linking is performed on the text encoding through the sub-layer in the encoder, the input text encoding is linearly superposed based on nonlinear transformation, and the processing results are normalized;
在本实施例中,对于根据注意力机制对所述文本编码进行线性变换和投影处理,计算所述文本编码对应的注意力值,包括:In this embodiment, linear transformation and projection processing are performed on the text encoding according to the attention mechanism, and the attention value corresponding to the text encoding is calculated, including:
采用多头自注意力机制(multi-head self-attention mechanism)对所述文本编码进行线性变换,并面向注意力机制下的不同维度进行投影,具体地,采用以下公式进行处理:A multi-head self-attention mechanism is used to linearly transform the text encoding and project to different dimensions under the attention mechanism. Specifically, the following formula is used for processing:
Zl=Lin(Yl)+TZl=Lin(Yl)+T
Xl=SoftMax(Zl*ET)*(E+S)Xl=SoftMax(Zl*ET)*(E+S)
Cl=Lin(Xl)Cl=Lin(Xl)
其中,Zl是简单线性层运算,Lin是一个线性组合,用于编码的线性变换,Y1是对应于文本编码的参数,T是对应修改的词向量,Xl是当前输出的预测的字,Cl配合参与运算以确定输出到下一层卷积层的信息,ET是待纠错句子对应的文本编码在编码器隐藏层中的转置,E是待纠错句子编码器的输出结果,S是输入的词向量,SoftMax是激活层,在上述公式中表示其后接括号内的运算过程通过激活层来实现。Among them, Zl is a simple linear layer operation, Lin is a linear combination, used for linear transformation of encoding, Y1 is the parameter corresponding to the text encoding, T is the corresponding modified word vector, Xl is the predicted word currently output, and Cl cooperates Participate in the operation to determine the information output to the next convolutional layer. ET is the transpose of the text encoding corresponding to the sentence to be corrected in the hidden layer of the encoder. E is the output result of the sentence encoder to be corrected. S is the input. The word vector of SoftMax is the activation layer. In the above formula, it means that the operation process in parentheses is implemented through the activation layer.
当然,在实际应用中,也可以采用注意力机制(Attention)对所述文本编码进行处理。Of course, in practical applications, the attention mechanism (Attention) can also be used to process the text encoding.
在本实施例中,对于根据所述文本纠错模型中预设的方式对所述注意力值进行拼接组合,得到注意力值集合,包括:In this embodiment, the attention values are spliced and combined according to the preset method in the text error correction model to obtain a set of attention values, including:
根据多头注意力机制,将编码向量对所述多头注意力机制的预设维度Q,K,V进行投影,最后将不同的注意力值(attention)结果拼接起来,得到注意力值集合,表示为:According to the multi-head attention mechanism, the encoding vector is projected onto the preset dimensions Q, K, V of the multi-head attention mechanism, and finally the different attention value (attention) results are spliced together to obtain the attention value set, expressed as :
MultiHead(Q,K,V)=Concat(head1,...,headh)WOMultiHead(Q,K,V)=Concat(head1,...,headh)WO
其中,head1,...,headh表示多头注意力机制中各注意力头(head),WO表示将各注意力头(head)的投影结果进行转换的预设参数。Among them, head1,..., headh represent each attention head (head) in the multi-head attention mechanism, and WO represents the preset parameters for converting the projection results of each attention head (head).
在本实施例中,进一步地,通过长短时记忆模型调用困惑度值计算算法,对所述注意力值集合中包含的注意力值进行迭代计算,得到对应的困惑度,包括采用以下方式计算困惑度(perplexity):In this embodiment, further, the perplexity value calculation algorithm is called through the long short-term memory model, and the attention values contained in the attention value set are iteratively calculated to obtain the corresponding perplexity, including calculating the perplexity in the following manner perplexity:
Figure PCTCN2022089175-appb-000001
Figure PCTCN2022089175-appb-000001
PP(W)表示句子W的困惑度值,Wω 1、ω 2以及ω N均表示句子W对应的词向量的注意力值,其中,ω N的下标N表示当前迭代计算所选取的注意力值编号,P(ω 1ω 2…ω N)表示通过注意力值计算的句子概率。通过公式所表示的计算方式对所有注意力值进行迭代计算得出困惑度值。 PP(W) represents the perplexity value of sentence W. Wω 1 , ω 2 and ω N all represent the attention value of the word vector corresponding to sentence W. Among them, the subscript N of ω N represents the attention selected for the current iteration calculation. The value number, P(ω 1 ω 2 …ω N ), represents the sentence probability calculated by the attention value. The perplexity value is obtained by iteratively calculating all attention values using the calculation method represented by the formula.
在本实施例中,进一步地,通过所述长短时记忆模型,根据所述困惑度对文本编码进行概率预测,得到概率预测结果,包括:In this embodiment, further, through the long short-term memory model, probability prediction is performed on text encoding according to the perplexity, and a probability prediction result is obtained, including:
将所述待纠错文本中的各语句的困惑度与预设困惑度阈值进行比较;Compare the perplexity of each statement in the text to be corrected with a preset perplexity threshold;
若所述待纠错文本中的各语句的困惑度小于所述预设困惑度阈值,则判断所述待纠错文本中的各语句为无需纠错的语句;If the perplexity of each sentence in the text to be corrected is less than the preset perplexity threshold, then it is determined that each sentence in the text to be corrected is a sentence that does not require error correction;
若所述待纠错文本中的各语句的困惑度大于或者等于所述预设困惑度阈值,则判断所述待纠错文本中的各语句为需要纠错的语句;If the perplexity of each sentence in the text to be corrected is greater than or equal to the preset perplexity threshold, then each sentence in the text to be corrected is determined to be a sentence that needs to be corrected;
当所述待纠错文本中的各语句为需要纠错的语句时,通过所述长短时记忆模型对文本编码进行概率预测,得到概率预测结果。When each sentence in the text to be corrected is a sentence that needs to be corrected, probability prediction is performed on the text encoding through the long short-term memory model to obtain a probability prediction result.
在本实施例中,对于预先训练得到的文本纠错模型,通过如下方式训练得到:In this embodiment, the pre-trained text error correction model is trained in the following way:
提取Transformer模型框架中的编码器和长短时记忆模型(LSTM)框架中的解码器;Extract the encoder in the Transformer model framework and the decoder in the Long Short-Term Memory Model (LSTM) framework;
在所述编码器和所述解码器之间进行嵌入层参数共享,并对嵌入层参数进行因式分解,构建混合架构模型;Sharing embedding layer parameters between the encoder and the decoder, factoring the embedding layer parameters, and constructing a hybrid architecture model;
从具有纠错信息的文本数据中构建训练数据集,基于所述训练数据集对所述混合架构模型进行学习训练,得到初始训练模型;Construct a training data set from text data with error correction information, perform learning and training on the hybrid architecture model based on the training data set, and obtain an initial training model;
对所述初始训练模型进行量化处理,得到文本纠错模型。The initial training model is quantized to obtain a text error correction model.
在本实施例中,对于嵌入层参数进行因式分解,包括:In this embodiment, factoring the parameters of the embedding layer includes:
在嵌入层(Embedding Layer)和隐藏层之间加入一个项目层(Project),其中,项目层分别与嵌入层和隐藏层进行连接;Add a project layer (Project) between the Embedding Layer and the hidden layer, where the project layer is connected to the embedding layer and the hidden layer respectively;
降低嵌入层的维度,使得嵌入层的维度不与隐藏层的维度一致,对嵌入层参数进行因式分解,例如:设词汇表的大小为V,词嵌入的维度为E,隐藏层的维度为H,则分解之前的参数量为V*H,此处的V是词汇表的大小,通常为几万,H是隐藏层大小,通常为几百到几千,进行因式分解之后,因为词嵌入维度E远小于隐藏层的维度H,所以分解后的参数量会远小于分解前的参数量。Reduce the dimension of the embedding layer so that the dimension of the embedding layer is not consistent with the dimension of the hidden layer. Factorize the parameters of the embedding layer. For example, let the size of the vocabulary be V, the dimension of the word embedding be E, and the dimension of the hidden layer be H, then the number of parameters before decomposition is V*H, where V is the size of the vocabulary, usually tens of thousands, and H is the size of the hidden layer, usually hundreds to thousands. After factoring, because the word The embedding dimension E is much smaller than the hidden layer dimension H, so the number of parameters after decomposition will be much smaller than the number of parameters before decomposition.
在本实施例中,对于从具有纠错信息的文本数据中构建训练数据集,基于所述训练数据集对所述混合架构模型进行学习训练,得到初始训练模型,包括:In this embodiment, for constructing a training data set from text data with error correction information, the hybrid architecture model is learned and trained based on the training data set to obtain an initial training model, including:
采集具有纠错信息的文本数据,其中,具有纠错信息的文本数据包括文本;Collecting text data with error correction information, where the text data with error correction information includes text;
对具有纠错信息的文本数据按照预设的方式进行分类,例如,将文本数据按照语言分类,分为中文、英文和特殊符号;Classify text data with error correction information in a preset manner, for example, classify text data according to language into Chinese, English and special symbols;
对具有纠错信息的文本数据按照预设的方式进行拆分和组合,构建训练数据集,例如,将文本信息拆分成句子,将纠错信息与对原文文本按照对应关系进行组合,得到训练数据集;Split and combine the text data with error correction information in a preset way to build a training data set. For example, split the text information into sentences, combine the error correction information with the original text according to the corresponding relationship, and obtain the training data set;
基于所述训练数据集对所述混合架构模型进行学习训练,得到初始训练模型。The hybrid architecture model is learned and trained based on the training data set to obtain an initial training model.
在本实施例中,对于对所述初始训练模型进行量化处理,得到文本纠错模型,包括:In this embodiment, the text error correction model is obtained by performing quantification processing on the initial training model, including:
通过预设的量化算法模型对所述初始训练模型进行量化处理,例如:Deepcompression、Binary-Net、Tenary-Net和Dorefa-Net;Quantify the initial training model through preset quantization algorithm models, such as: Deepcompression, Binary-Net, Tenary-Net and Dorefa-Net;
在实际应用中,对初始训练模型进行量化处理后,可以使用8位存储每个32位浮点权重,虽然这意味着每个权重都以较低的保真度存储,但模型的质量并未受到重大影响。In practical applications, after quantizing the initial training model, each 32-bit floating point weight can be stored using 8 bits. Although this means that each weight is stored with lower fidelity, the quality of the model is not be significantly affected.
103、根据最小编辑距离算法,计算文本信息包含的字符串和文本纠错结果包含的字符串之间的最小编辑距离;103. According to the minimum edit distance algorithm, calculate the minimum edit distance between the string contained in the text information and the string contained in the text error correction result;
该步骤中,包括:This step includes:
提取所述文本信息和与之对应的所述文本纠错结果中的所有字符,形成字符集;Extract the text information and all characters in the corresponding text error correction result to form a character set;
按照预设的拆分方法对所述字符集进行拆分,形成字符串;Split the character set according to a preset splitting method to form a string;
根据所述文本信息和所述文本纠错结果之间的对应关系,将所述字符串转换成具有所述对应关系的字符矩阵,其中,所述字符矩阵包含所述字符串中所有字符的字符特征值;According to the correspondence between the text information and the text error correction result, the string is converted into a character matrix with the correspondence, wherein the character matrix contains characters of all characters in the string Eigenvalues;
根据预设的编辑操作类型,构建动态规划方程;Construct a dynamic programming equation according to the preset editing operation type;
基于所述动态规划方程对所述字符矩阵中的各字符特征值进行编辑距离运算,得到所述文本信息中包含的字符和与其对应的文本纠错结果中包含的字符之间的最小编辑距离。Based on the dynamic programming equation, an edit distance operation is performed on each character feature value in the character matrix to obtain the minimum edit distance between the characters included in the text information and the characters included in the corresponding text error correction result.
在本实施例中,对于按照预设的拆分方法对所述字符集进行拆分,形成字符串,包括:In this embodiment, splitting the character set according to a preset splitting method to form a string includes:
对字符集根据字符与文本中句子间的对应关系进行拆分,得到字符串,其中,一个字符串中所包含的字符属于同一个句子。The character set is split according to the correspondence between characters and sentences in the text to obtain a string, in which the characters contained in a string belong to the same sentence.
在实际应用中,也可以根据基于特定语法所构成的短语对所述字符集进行拆分,例如:以包含至少一个动词和至少一个名词的短语为单位对所述字符集进行拆分。In practical applications, the character set may also be split according to phrases formed based on specific grammar. For example, the character set may be split into phrases containing at least one verb and at least one noun.
在本实施例中,对于根据预设的编辑操作类型,构建动态规划方程,包括:In this embodiment, the dynamic programming equation is constructed according to the preset editing operation type, including:
当编辑操作类型设置为:删除一个字符、***一个字符和修改一个字符时,可用以下方式构建动态规划方程:When the editing operation type is set to: deleting a character, inserting a character, and modifying a character, the dynamic programming equation can be constructed in the following way:
用edit[i][j]表示A串和B串的编辑距离。edit[i][j]表示A串从第0个字符开始到第i个字符和B串从第0个字符开始到第j个字符,这两个字串的编辑距离;字符串的下标从1开始;Use edit[i][j] to represent the edit distance between string A and string B. edit[i][j] represents the edit distance of string A starting from the 0th character to the i-th character and string B starting from the 0th character to the j-th character; the subscript of the string Start from 1;
dis[0][0]表示word1和word2都为空的时候,此时两者间的Edit Distance(编辑距离)为0。可以得出,dis[0][j]就是word1为空,word2长度为j的情况,此时他们的Edit Distance为j,也就是从空,添加j个字符转换成word2的最小Edit Distance为j;同理dis[i][0]就是,word1长度为i,word2为空时,word1需要删除i个字符才能转换成空,所以转换成word2的最小Edit Distance为i;dis[0][0] means that when both word1 and word2 are empty, the Edit Distance between them is 0. It can be concluded that dis[0][j] is the case where word1 is empty and the length of word2 is j. At this time, their Edit Distance is j, that is, the minimum Edit Distance to convert from empty to word2 by adding j characters is j. ;Similarly, dis[i][0] means that when the length of word1 is i and word2 is empty, word1 needs to delete i characters before it can be converted into empty, so the minimum Edit Distance converted into word2 is i;
根据以上说明,构建动态规划方程如下:According to the above description, the dynamic programming equation is constructed as follows:
Figure PCTCN2022089175-appb-000002
Figure PCTCN2022089175-appb-000002
其中:in:
Figure PCTCN2022089175-appb-000003
Figure PCTCN2022089175-appb-000003
上式中的min()函数中的三个部分,对应三种字符操作方式:The three parts of the min() function in the above formula correspond to three character operation methods:
edit[i-1][j]+1相当于给word2的最后***了word1的最后的字符,***操作使得 edit+1,之后计算edit[i-1][j];edit[i-1][j]+1 is equivalent to inserting the last character of word1 into the end of word2. The insertion operation makes edit+1, and then edit[i-1][j] is calculated;
edit[i][j-1]+1相当于将word2的最后字符删除,删除操作edit+1,之后计算edit[i][j-1];edit[i][j-1]+1 is equivalent to deleting the last character of word2, deleting the operation edit+1, and then calculating edit[i][j-1];
edit[i-1][j-1]+flag相当于通过将word2的最后一个字符替换为word1的最后一个字符;flag标记代表替换的有效次数。edit[i-1][j-1]+flag is equivalent to replacing the last character of word2 with the last character of word1; the flag mark represents the valid number of substitutions.
104、对文本信息包含的字符串和文本纠错结果包含的字符串按照最小编辑距离进行映射处理,得到文本纠错意见;104. Map the strings contained in the text information and the strings contained in the text error correction results according to the minimum edit distance to obtain text error correction opinions;
该步骤中,具体地,包括:This step specifically includes:
获取所述文本信息包含的字符串和所述文本纠错结果包含的字符串,并遍历字符串之间的最小编辑距离,并对最小编辑距离对应代价值进行比较,选取使得代价值最小的字符串组合,构造字符串对应组;Obtain the string contained in the text information and the string contained in the text error correction result, traverse the minimum edit distance between the strings, compare the minimum edit distance corresponding to the cost value, and select the character with the minimum cost value String combination, construct string corresponding group;
根据预设的编辑操作类型和字符串对应组中字符串之间的最小编辑距离,按照把所述文本信息中的字符串编辑成所述文本纠错结果中的字符串的编辑方向,将所述最小编辑距离转换成编辑操作序列,其中,编辑操作序列中包含对字符串中字符进行编辑时所涉及的编辑位置、编辑操作类型和编辑顺序,例如,当预设的编辑操作类型为删除一个字符、***一个字符和修改一个字符时,将所述最小编辑距离转换成包含删除一个字符、***一个字符和修改一个字符的编辑操作所作用的编辑位置、编辑操作类型和编辑顺序的编辑操作序列;According to the preset editing operation type and the minimum editing distance between the strings in the string corresponding group, according to the editing direction of editing the string in the text information into the string in the text error correction result, all the strings in the text error correction result are edited. The above minimum editing distance is converted into an editing operation sequence, where the editing operation sequence includes the editing position, editing operation type and editing sequence involved in editing the characters in the string. For example, when the preset editing operation type is to delete a When a character is inserted, a character is inserted, and a character is modified, the minimum editing distance is converted into an editing operation sequence including the editing position, type of editing operation, and editing order of the editing operations of deleting a character, inserting a character, and modifying a character. ;
按照预设的输出方式输出包含所述文本信息和与其对应的编辑操作序列,得到文本纠错意见,例如,以交互式的方式将编辑操作序列内容进行输出,供用户选择性纠正,具体地,可以使用箭头连接待纠错文本和纠错结果,两个句子基于最小编辑语法得到映射关系,使用不同颜色的箭头代表不同的编辑操作,可选地,黑色箭头表示无需处理,黄色箭头表示需要修改,红色表示该词应该被删掉,绿色表示应该添加的内容,得到文本纠错意见,给出各种修正意见供用户参考。Output the text information and the corresponding editing operation sequence according to a preset output method to obtain text error correction opinions, for example, output the editing operation sequence content in an interactive manner for the user to selectively correct. Specifically, Arrows can be used to connect the text to be corrected and the error correction results. The two sentences are mapped based on the minimal editing syntax. Arrows of different colors represent different editing operations. Optionally, black arrows indicate that no processing is required, and yellow arrows indicate that modification is required. , red indicates that the word should be deleted, green indicates content that should be added, text correction opinions are obtained, and various correction opinions are given for user reference.
通过对上述方法的实施,通过对待纠错文本数据进行预处理,得到文本信息后输入至预先训练得到的文本纠错模型进行文本纠错处理,得到文本信息对应的文本纠错结果;根据最小编辑距离算法,计算文本信息中包含的字符和与其对应的文本纠错结果中包含的字符之间的最小编辑距离;对文本信息中包含的字符和与之对应的所述文本纠错结果中包含的字符按照最小编辑距离进行映射处理,得到文本纠错意见;通过计算最小编辑距离得到本文纠错意见,以体现出错误内容与正确内容之间的关系,并给出错误内容在文本中的位置,以便于用户进行实时调整,解决现有的纠错方案在纠错过程中无法给出具体的错误位置和错误类型,无法对纠错内容直观显示的问题。Through the implementation of the above method, by preprocessing the text data to be corrected, the text information is obtained and then input into the pre-trained text error correction model for text error correction processing, and the text error correction results corresponding to the text information are obtained; according to the minimum edit The distance algorithm calculates the minimum editing distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; the characters contained in the text information and the corresponding characters contained in the text error correction result are Characters are mapped according to the minimum edit distance to obtain text error correction opinions; the error correction opinions of this article are obtained by calculating the minimum edit distance to reflect the relationship between the incorrect content and the correct content, and give the location of the incorrect content in the text. This facilitates users to make real-time adjustments and solves the problem that existing error correction solutions cannot provide specific error locations and error types during the error correction process, and cannot visually display error correction content.
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图2,本申请实施例中文本纠错方法的第二个实施例,该方法的实现步骤如下:For ease of understanding, the specific process of the embodiment of the present application is described below. Please refer to Figure 2, which is the second embodiment of the text error correction method in the embodiment of the present application. The implementation steps of the method are as follows:
201、获取待纠错数据,并对待纠错数据进行预处理,得到文本信息;201. Obtain the data to be corrected and preprocess the data to be corrected to obtain text information;
在本实施例中,对于该步骤,具体通过去除文本数据中的噪声,得到去除噪声后的文本数据;将去除噪声后的文本数据的文本形式转换成预设的文本形式,得到转换格式后的文本数据;对转换格式后的文本数据按照预设的类别和特征进行分类和筛选,得到文本信息。In this embodiment, for this step, the noise in the text data is specifically removed to obtain the text data after the noise is removed; the text form of the text data after the noise is removed is converted into a preset text form, and the converted format is obtained. Text data: Classify and filter the converted text data according to preset categories and characteristics to obtain text information.
202、将文本信息输入至采用Transformer模型架构的编码器进行编码处理,得到文本编码;202. Input the text information to the encoder using the Transformer model architecture for encoding processing to obtain text encoding;
在本实施例中,对于该步骤,包括:In this embodiment, this step includes:
将文本信息输入至采用Transformer模型架构的编码器后,可以采用压缩编码的编码方式进行编码处理,其中,压缩编码是一种可压缩数据大小的编码方式,比如BPE编码(byte  pairencoder,双字节编码)。BPE编码,也可以叫做双字母组合编码(digram coding),主要目的是为了数据压缩。BPE编码的方式主要是,将字符串里频率最常见的一对字符被一个没有在这个字符中出现的字符代替的层层迭代过程。例如,当初始文本中的词为“student”,可用字符“A”替代字符“stu”,用字符“B”替代“dent”,那么该词“student”可被编码为“AB”。在实际应用中,可以词、词组、句子等为单位进行压缩编码,以得到整个文本信息对应的文本编码。After inputting the text information into the encoder using the Transformer model architecture, compression encoding can be used for encoding processing. Compression encoding is a encoding method that can compress the data size, such as BPE encoding (byte pairencoder, double byte coding). BPE encoding, also called digram coding, is mainly for data compression. The BPE encoding method is mainly an iterative process of replacing the most common pair of characters in a string with a character that does not appear in this character. For example, when the word in the initial text is "student", the character "A" can be used to replace the character "stu", and the character "B" can be used to replace "dent", then the word "student" can be encoded as "AB". In practical applications, compression encoding can be performed in units of words, phrases, sentences, etc., to obtain the text encoding corresponding to the entire text information.
当然,也可以采用基于注意力机制的编码方式,具体地,通过提取文本信息中的特征信息并转换成特征向量,并对特征向量进行基于注意力机制的编码处理,以得到整个文本信息对应的文本编码。Of course, the encoding method based on the attention mechanism can also be used. Specifically, by extracting the feature information in the text information and converting it into a feature vector, and then encoding the feature vector based on the attention mechanism to obtain the corresponding information of the entire text information. Text encoding.
203、将文本编码输入至长短时记忆模型进行预测计算,并通过预测计算结果进行文本纠错处理,得到文本纠错结果;203. Input the text encoding into the long short-term memory model for prediction calculation, and perform text error correction processing through the prediction calculation results to obtain the text error correction result;
在本实施例中,对于该步骤,包括:In this embodiment, this step includes:
采集包含纠错信息的文本信息,建立训练数据集,并基于训练数据集训练长短时记忆模型,生成纠错词汇表,其中,纠错词汇表包含具有错误的语句的文本编码以及对应的纠错文本的文本编码;Collect text information containing error correction information, establish a training data set, and train a long-short-term memory model based on the training data set to generate an error correction vocabulary. The error correction vocabulary contains text encodings of sentences with errors and corresponding error corrections. the text encoding of the text;
通过长短时记忆模型获取文本编码,并对文本编码的各部分进行困惑度计算,得到困惑度值;Obtain the text encoding through the long short-term memory model, and calculate the perplexity of each part of the text encoding to obtain the perplexity value;
判断文本编码各部分的困惑度值是否大于设定阈值,若大于,则将该部分文本编码替换成纠错词汇表中对应的纠错文本的文本编码;Determine whether the perplexity value of each part of the text encoding is greater than the set threshold. If it is greater, replace the text encoding of this part with the text encoding of the corresponding error correction text in the error correction vocabulary;
对替换后的文本编码进行困惑度计算,得到困惑度值;Perform perplexity calculation on the replaced text encoding to obtain the perplexity value;
比较替换前后的文本编码困惑度值,若替换后的文本编码困惑度值较大,则撤销替换操作;Compare the text encoding perplexity value before and after replacement. If the text encoding perplexity value after replacement is larger, cancel the replacement operation;
将文本编码按照长短时记忆模型的解码规则进行解码,得到文本纠错结果。The text encoding is decoded according to the decoding rules of the long and short-term memory model to obtain the text error correction result.
204、根据最小编辑距离算法,计算文本信息中包含的句子和文本纠错结果中包含的句子之间的最小编辑距离;204. According to the minimum edit distance algorithm, calculate the minimum edit distance between the sentences contained in the text information and the sentences contained in the text error correction results;
在本实施例中,对于该步骤,包括:In this embodiment, this step includes:
分别获取文本信息和文本纠错结果,并分别按照句子进行拆分,并转换成字符串集合,其中,字符串集合包括文本信息包含的句子对应的字符串集合和文本纠错结果包含的句子对应的字符串集合;The text information and text error correction results are obtained separately, split according to sentences, and converted into a string set. The string set includes a string set corresponding to the sentences contained in the text information and a string set corresponding to the sentences contained in the text error correction results. A collection of strings;
根据预设的编辑操作类型,构建动态规划方程,并对字符串集合进行编辑距离运算,得到字符串之间的最小编辑距离,该最小编辑距离表示文本信息中包含的句子和文本纠错结果中包含的句子之间的最小编辑距离。According to the preset editing operation type, a dynamic programming equation is constructed, and the edit distance operation is performed on the string set to obtain the minimum edit distance between strings. The minimum edit distance represents the sentences contained in the text information and the text error correction results. Minimum edit distance between contained sentences.
205、对文本信息中的句子和与之对应的文本纠错结果中句子按照句子之间的最小编辑距离进行映射处理,得到文本纠错意见;205. Mapping the sentences in the text information and the corresponding sentences in the text correction results according to the minimum editing distance between the sentences to obtain text correction opinions;
在本实施例中,对于该步骤,包括:In this embodiment, this step includes:
根据预设的编辑操作类型,将所述最小编辑距离转换成编辑操作序列,例如,当预设的编辑操作类型为删除一个字符、***一个字符和修改一个字符时,将所述最小编辑距离转换成包含删除一个字符、***一个字符和修改一个字符的编辑操作的编辑操作序列;Convert the minimum editing distance into an editing operation sequence according to a preset editing operation type. For example, when the preset editing operation type is deleting a character, inserting a character, and modifying a character, convert the minimum editing distance into a sequence. into an editing operation sequence including the editing operations of deleting a character, inserting a character, and modifying a character;
按照预设的输出方式输出包含所述文本信息和与其对应的编辑操作序列,得到文本纠错意见,例如,以交互式的方式将编辑操作序列内容进行输出,供用户选择性纠正,具体地,可以使用箭头连接待纠错文本和纠错结果,两个句子基于最小编辑语法得到映射关系,使用不同颜色的箭头代表不同的编辑操作,可选地,黑色箭头表示无需处理,黄色箭头表示需要修改,红色表示该词应该被删掉,绿色表示应该添加的内容,得到文本纠错意见,给出各种修正意见供用户参考。Output the text information and the corresponding editing operation sequence according to a preset output method to obtain text error correction opinions, for example, output the editing operation sequence content in an interactive manner for the user to selectively correct. Specifically, Arrows can be used to connect the text to be corrected and the error correction results. The two sentences are mapped based on the minimal editing syntax. Arrows of different colors represent different editing operations. Optionally, black arrows indicate that no processing is required, and yellow arrows indicate that modification is required. , red indicates that the word should be deleted, green indicates content that should be added, text correction opinions are obtained, and various correction opinions are given for user reference.
通过对上述方法的实施,通过对待纠错文本数据进行预处理,得到文本信息后输入至预先训练得到的文本纠错模型进行文本纠错处理,得到文本信息对应的文本纠错结果;根据最小编辑距离算法,计算文本信息中包含的字符和与其对应的文本纠错结果中包含的字符之间的最小编辑距离;对文本信息中包含的字符和与之对应的所述文本纠错结果中包含的字符按照最小编辑距离进行映射处理,得到文本纠错意见;通过计算最小编辑距离得到本文纠错意见,以体现出错误内容与正确内容之间的关系,并给出错误内容在文本中的位置,以便于用户进行实时调整,解决现有的纠错方案在纠错过程中无法给出具体的错误位置和错误类型,无法对纠错内容直观显示的问题。Through the implementation of the above method, by preprocessing the text data to be corrected, the text information is obtained and then input into the pre-trained text error correction model for text error correction processing, and the text error correction results corresponding to the text information are obtained; according to the minimum edit The distance algorithm calculates the minimum editing distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; the characters contained in the text information and the corresponding characters contained in the text error correction result are Characters are mapped according to the minimum edit distance to obtain text error correction opinions; the error correction opinions of this article are obtained by calculating the minimum edit distance to reflect the relationship between the incorrect content and the correct content, and give the location of the incorrect content in the text. This facilitates users to make real-time adjustments and solves the problem that existing error correction solutions cannot provide specific error locations and error types during the error correction process, and cannot visually display error correction content.
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图3,本申请实施例中文本纠错方法的第三个实施例,该方法的实现步骤如下:For ease of understanding, the specific process of the embodiment of the present application is described below. Please refer to Figure 3, which is the third embodiment of the text error correction method in the embodiment of the present application. The implementation steps of the method are as follows:
301、获取待纠错数据,并对待纠错数据进行预处理,得到文本信息;301. Obtain the data to be corrected and preprocess the data to be corrected to obtain text information;
对于该步骤,在本实施例中,具体地,包括:For this step, in this embodiment, it specifically includes:
对待纠错数据进行数据清洗,得到经过数据清洗后的文本数据;Perform data cleaning on the data to be corrected and obtain text data after data cleaning;
对经过数据清洗后的文本数据按照预设的文本类别进行分类,得到不同类别的文本信息;Classify the text data after data cleaning according to the preset text categories to obtain different categories of text information;
将文本信息输入至采用Transformer模型架构的编码器进行编码处理,得到文本编码;Input text information to the encoder using the Transformer model architecture for encoding processing to obtain text encoding;
对于该步骤,在本实施例中,包括:For this step, in this embodiment, it includes:
采用基于注意力(Attention)机制的编码方式,通过提取文本信息中的特征信息(Source)并转换成特征向量,并通过注意力机制表示为包含地址(Key)和值(Value)的数据对<Key,Value>,此时给定目标(Target)中的某个查询元素(Query),通过计算Query和各个Key的相关度(Similarity),得到每个Key对应Value的权重系数,然后对Value进行加权求和,得到Attention数值;在实际应用中,利用Attention机制,采用以下公式对Source中元素的Value值进行加权求和,其中,Query和Key用来计算对应Value的权重系数。Using an encoding method based on the attention (Attention) mechanism, the feature information (Source) in the text information is extracted and converted into a feature vector, and represented as a data pair containing an address (Key) and a value (Value) through the attention mechanism< Key, Value>, at this time, given a query element (Query) in the target (Target), by calculating the similarity (similarity) between Query and each Key, the weight coefficient of each Key corresponding to the Value is obtained, and then the Value is Weighted summation is used to obtain the Attention value; in practical applications, the Attention mechanism is used and the following formula is used to perform a weighted summation of the Value values of the elements in the Source, in which Query and Key are used to calculate the weight coefficient of the corresponding Value.
Figure PCTCN2022089175-appb-000004
Figure PCTCN2022089175-appb-000004
其中,Attention(Query,Source)表示通过Attention机制将特征信息(Source)转换成目标(Target)中的Query元素的过程中,需要计算的Attention值,Similarity表示Query和各个Key的相关度,Lx代表Source的长度,Key和Value的下标i代表其在数据对<Key,Value>中的编号,便于加权求和。Among them, Attention (Query, Source) represents the Attention value that needs to be calculated during the process of converting feature information (Source) into Query elements in the target (Target) through the Attention mechanism, Similarity represents the correlation between Query and each Key, and Lx represents The length of Source, the subscript i of Key and Value represent its number in the data pair <Key, Value>, which facilitates weighted summation.
根据特征向量对应的Attention数值,进行基于注意力机制的编码处理,得到整个文本信息对应的文本编码。According to the Attention value corresponding to the feature vector, encoding processing based on the attention mechanism is performed to obtain the text encoding corresponding to the entire text information.
303、将文本编码输入至采用长短时记忆模型框架的解码器进行解码,得到文本纠错结果;303. Input the text encoding to the decoder using the long short-term memory model framework for decoding, and obtain the text error correction result;
对于该步骤,在本实施例中,具体地,包括:For this step, in this embodiment, it specifically includes:
通过长短时记忆模型调用困惑度值计算算法,计算文本编码对应的困惑度值,包括采用以下公式计算困惑度值(perplexity):The perplexity value calculation algorithm is called through the long short-term memory model to calculate the perplexity value corresponding to the text encoding, including using the following formula to calculate the perplexity value (perplexity):
Figure PCTCN2022089175-appb-000005
Figure PCTCN2022089175-appb-000005
PP(W)表示句子W的困惑度值,Wω 1、ω 2以及ω N均表示句子W所包含的词语所对应的文本编码,其中,ω N的下标N表示当前迭代计算所选取的词语对应的文本编码范围,P(ω 1ω 2…ω N)表示包含句子概率。 PP(W) represents the perplexity value of sentence W. Wω 1 , ω 2 and ω N all represent the text codes corresponding to the words contained in sentence W. Among them, the subscript N of ω N represents the words selected in the current iterative calculation. The corresponding text encoding range, P(ω 1 ω 2 …ω N ), represents the probability of containing a sentence.
将所述待纠错文本中的各语句的困惑度与预设困惑度阈值进行比较;Compare the perplexity of each statement in the text to be corrected with a preset perplexity threshold;
若所述待纠错文本中的各语句的困惑度小于所述预设困惑度阈值,则判断所述待纠错文本中的各语句为无需纠错的语句;If the perplexity of each sentence in the text to be corrected is less than the preset perplexity threshold, then it is determined that each sentence in the text to be corrected is a sentence that does not require error correction;
若所述待纠错文本中的各语句的困惑度大于或者等于所述预设困惑度阈值,则判断所述待纠错文本中的各语句为需要纠错的语句;If the perplexity of each sentence in the text to be corrected is greater than or equal to the preset perplexity threshold, then each sentence in the text to be corrected is determined to be a sentence that needs to be corrected;
当所述待纠错文本中的各语句为需要纠错的语句时,通过所述长短时记忆模型对文本编码进行概率预测,并替换成预测的文本编码,得到概率预测结果;When each sentence in the text to be corrected is a sentence that needs to be corrected, the text encoding is probabilistically predicted through the long short-term memory model and replaced with the predicted text encoding to obtain a probabilistic prediction result;
基于概率预测结果,将文本编码解码成文本的形式,得到文本纠错结果。Based on the probability prediction results, the text is encoded and decoded into text form to obtain text error correction results.
304、根据最小编辑距离算法,计算文本信息中包含的句子和文本纠错结果中包含的句子之间的最小编辑距离;304. According to the minimum edit distance algorithm, calculate the minimum edit distance between the sentences contained in the text information and the sentences contained in the text error correction results;
在本实施例中,对于该步骤,包括:In this embodiment, this step includes:
分别获取文本信息和文本纠错结果,并分别按照句子进行拆分,并转换成字符串集合,其中,字符串集合包括文本信息包含的句子对应的字符串集合和文本纠错结果包含的句子对应的字符串集合;The text information and text error correction results are obtained separately, split according to sentences, and converted into a string set. The string set includes a string set corresponding to the sentences contained in the text information and a string set corresponding to the sentences contained in the text error correction results. A collection of strings;
根据预设的编辑操作类型,构建动态规划方程,并对字符串集合进行编辑距离运算,得到字符串之间的最小编辑距离,该最小编辑距离表示文本信息中包含的句子和文本纠错结果中包含的句子之间的最小编辑距离。According to the preset editing operation type, a dynamic programming equation is constructed, and the edit distance operation is performed on the string set to obtain the minimum edit distance between strings. The minimum edit distance represents the sentences contained in the text information and the text error correction results. Minimum edit distance between contained sentences.
305、对文本信息中的句子和与之对应的文本纠错结果中句子按照句子之间的最小编辑距离进行映射处理,得到文本纠错意见;305. Mapping the sentences in the text information and the corresponding sentences in the text correction result according to the minimum editing distance between the sentences, and obtaining text correction opinions;
在本实施例中,对于该步骤,包括:In this embodiment, this step includes:
根据预设的编辑操作类型,将所述最小编辑距离转换成编辑操作序列,例如,当预设的编辑操作类型为删除一个字符、***一个字符和修改一个字符时,将所述最小编辑距离转换成包含删除一个字符、***一个字符和修改一个字符的编辑操作的编辑操作序列;Convert the minimum editing distance into an editing operation sequence according to a preset editing operation type. For example, when the preset editing operation type is deleting a character, inserting a character, and modifying a character, convert the minimum editing distance into a sequence. into an editing operation sequence including the editing operations of deleting a character, inserting a character, and modifying a character;
按照预设的输出方式输出包含所述文本信息和与其对应的编辑操作序列,得到文本纠错意见,例如,以交互式的方式将编辑操作序列内容进行输出,供用户选择性纠正,具体地,可以以待纠错文本为基础进行呈现,对于其中的纠错内容,在用户界面上显示不同的颜色或者字体,并将对应的编辑操作序列信息以链接或者箭头的形式输出,得到文本纠错意见;Output the text information and the corresponding editing operation sequence according to a preset output method to obtain text error correction opinions, for example, output the editing operation sequence content in an interactive manner for the user to selectively correct. Specifically, It can be presented based on the text to be corrected. For the error correction content, different colors or fonts are displayed on the user interface, and the corresponding editing operation sequence information is output in the form of links or arrows to obtain text correction opinions. ;
在实际应用中,也可以采用以文本纠错结果为基础,在用户界面上将与待纠错文本有区别的地方突出显示,并将并将对应的编辑操作序列信息以链接或者箭头的形式输出,得到文本纠错意见。In practical applications, it is also possible to use the text error correction results as a basis to highlight the areas that are different from the text to be corrected on the user interface, and output the corresponding editing operation sequence information in the form of links or arrows. , get text correction opinions.
通过对上述方法的实施,通过对待纠错文本数据进行预处理,得到文本信息后输入至预先训练得到的文本纠错模型进行文本纠错处理,得到文本信息对应的文本纠错结果;根据最小编辑距离算法,计算文本信息中包含的字符和与其对应的文本纠错结果中包含的字符之间的最小编辑距离;对文本信息中包含的字符和与之对应的所述文本纠错结果中包含的字符按照最小编辑距离进行映射处理,得到文本纠错意见;通过计算最小编辑距离得到本文纠错意见,以体现出错误内容与正确内容之间的关系,并给出错误内容在文本中的位置,以便于用户进行实时调整,解决现有的纠错方案在纠错过程中无法给出具体的错误位置和错误类型,无法对纠错内容直观显示的问题。Through the implementation of the above method, by preprocessing the text data to be corrected, the text information is obtained and then input into the pre-trained text error correction model for text error correction processing, and the text error correction results corresponding to the text information are obtained; according to the minimum edit The distance algorithm calculates the minimum editing distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; the characters contained in the text information and the corresponding characters contained in the text error correction result are Characters are mapped according to the minimum edit distance to obtain text error correction opinions; the error correction opinions of this article are obtained by calculating the minimum edit distance to reflect the relationship between the incorrect content and the correct content, and give the location of the incorrect content in the text. This facilitates users to make real-time adjustments and solves the problem that existing error correction solutions cannot provide specific error locations and error types during the error correction process, and cannot visually display error correction content.
上面对本申请实施例中的文本纠错方法进行了描述,下面对本申请实施例中的文本纠错装置进行描述,请参照图4,本申请实施例中的文本纠错装置的一个实施例包括:The text error correction method in the embodiment of the present application is described above, and the text error correction device in the embodiment of the present application is described below. Please refer to Figure 4. An example of the text error correction device in the embodiment of the present application includes:
预处理模块401,用于获取待纠错数据,并对待纠错数据进行预处理,得到文本信息;The preprocessing module 401 is used to obtain the data to be corrected and preprocess the data to be corrected to obtain text information;
文本纠错处理模块402,用于将所述文本信息输入至预先训练得到的文本纠错模型进行文本纠错处理,得到所述文本信息对应的文本纠错结果;The text error correction processing module 402 is used to input the text information into a pre-trained text error correction model for text error correction processing, and obtain a text error correction result corresponding to the text information;
最小编辑距离计算模块403,用于根据最小编辑距离算法,计算所述文本信息包含的字符串和文本纠错结果包含的字符串之间的最小编辑距离;The minimum edit distance calculation module 403 is used to calculate the minimum edit distance between the string contained in the text information and the string contained in the text error correction result according to the minimum edit distance algorithm;
映射处理模块404,用于对所述文本信息包含的字符串和所述文本纠错结果包含的字符串按照所述最小编辑距离进行映射处理,得到文本纠错意见。The mapping processing module 404 is configured to perform mapping processing on the string contained in the text information and the string contained in the text error correction result according to the minimum edit distance to obtain text error correction opinions.
通过对上述装置的实施,通过对待纠错文本数据进行预处理,得到文本信息后输入至预先训练得到的文本纠错模型进行文本纠错处理,得到文本信息对应的文本纠错结果;根据最小编辑距离算法,计算文本信息中包含的字符和与其对应的文本纠错结果中包含的字符之间的最小编辑距离;对文本信息中包含的字符和与之对应的所述文本纠错结果中包含的字符按照最小编辑距离进行映射处理,得到文本纠错意见;通过计算最小编辑距离得到本文纠错意见,以体现出错误内容与正确内容之间的关系,并给出错误内容在文本中的位置,以便于用户进行实时调整,解决现有的纠错方案在纠错过程中无法给出具体的错误位置和错误类型,无法对纠错内容直观显示的问题。Through the implementation of the above device, by preprocessing the text data to be corrected, the text information is obtained and then input into the pre-trained text error correction model for text error correction processing, and the text error correction results corresponding to the text information are obtained; according to the minimum edit The distance algorithm calculates the minimum editing distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; the characters contained in the text information and the corresponding characters contained in the text error correction result are Characters are mapped according to the minimum edit distance to obtain text error correction opinions; the error correction opinions of this article are obtained by calculating the minimum edit distance to reflect the relationship between the incorrect content and the correct content, and give the location of the incorrect content in the text. This facilitates users to make real-time adjustments and solves the problem that existing error correction solutions cannot provide specific error locations and error types during the error correction process, and cannot visually display error correction content.
请参阅图5,本申请实施例中的文本纠错装置的另一个实施例包括:Please refer to Figure 5. Another embodiment of the text error correction device in the embodiment of the present application includes:
预处理模块401,用于获取待纠错数据,并对待纠错数据进行预处理,得到文本信息;The preprocessing module 401 is used to obtain the data to be corrected and preprocess the data to be corrected to obtain text information;
文本纠错模块402,用于判断文本信息的长度是否大于预设长度阈值;Text error correction module 402, used to determine whether the length of text information is greater than a preset length threshold;
最小编辑距离计算模块403,用于在判断长度大于预设长度阈值时,调用预设的文本摘要提取算法对文本信息进行精简处理,得到文本信息的摘要数据;The minimum edit distance calculation module 403 is used to call the preset text summary extraction algorithm to streamline the text information to obtain summary data of the text information when the length is judged to be greater than the preset length threshold;
映射处理模块404,用于将摘要数据输入至预先训练得到的文本纠错模型进行文本纠错处理,得到待纠错数据中的情感信息;The mapping processing module 404 is used to input the summary data into the pre-trained text error correction model for text error correction processing to obtain the emotional information in the data to be corrected;
模型训练模块405,用于提取Transformer模型框架中的编码器和长短时记忆模型框架中的解码器;在所述编码器和所述解码器之间进行嵌入层参数共享,并对嵌入层参数进行因式分解,构建混合架构模型;从具有纠错信息的文本数据中构建训练数据集,基于所述训练数据集对所述混合架构模型进行学习训练,得到初始训练模型;对所述初始训练模型进行量化处理,得到文本纠错模型。 Model training module 405, used to extract the encoder in the Transformer model framework and the decoder in the long short-term memory model framework; share embedding layer parameters between the encoder and the decoder, and perform embedding layer parameter sharing. Factoring, constructing a hybrid architecture model; constructing a training data set from text data with error correction information, learning and training the hybrid architecture model based on the training data set, and obtaining an initial training model; performing training on the initial training model Perform quantification processing to obtain a text error correction model.
其中,所述文本纠错模块402包括:Among them, the text error correction module 402 includes:
文本编码单元4021,用于通过采用Transformer模型架构的编码器对所述文本信息进行编码,得到文本编码; Text encoding unit 4021, used to encode the text information through an encoder using the Transformer model architecture to obtain text encoding;
第一计算单元4022,用于根据注意力机制对文本编码进行线性变换和投影处理,计算文本编码对应的注意力值;The first calculation unit 4022 is used to perform linear transformation and projection processing on the text encoding according to the attention mechanism, and calculate the attention value corresponding to the text encoding;
第二计算单元4023,用于调用困惑度值计算算法,对所述注意力值集合中包含的注意力值进行迭代计算,得到对应的困惑度;The second calculation unit 4023 is used to call the perplexity value calculation algorithm, iteratively calculate the attention values contained in the attention value set, and obtain the corresponding perplexity;
概率预测单元4024,用于根据困惑度对文本编码进行概率预测,得到概率预测结果; Probability prediction unit 4024, used to perform probability prediction on text encoding according to the degree of confusion, and obtain probability prediction results;
文本解码单元4025,用于根据概率预测结果对所述文本编码进行解码,得到文本纠错结果; Text decoding unit 4025, used to decode the text encoding according to the probability prediction results to obtain text error correction results;
在本实施例中,所述最小编辑距离计算模块403包括:In this embodiment, the minimum edit distance calculation module 403 includes:
字符转换单元4031,用于提取所述文本信息和与之对应的所述文本纠错结果中的所有字符,形成字符集,按照预设的拆分方法对所述字符集进行拆分,形成字符串;并根据所述文本信息和所述文本纠错结果之间的对应关系,将所述字符串转换成具有所述对应关系的字符矩阵;The character conversion unit 4031 is used to extract the text information and all characters in the corresponding text error correction result to form a character set, and split the character set according to a preset splitting method to form a character set. string; and according to the correspondence between the text information and the text error correction result, convert the string into a character matrix with the correspondence;
动态规划单元4032,用于根据预设的编辑操作类型,构建动态规划方程; Dynamic programming unit 4032, used to construct dynamic programming equations according to the preset editing operation type;
第三计算单元4033,用于对所述字符矩阵中的各字符特征值进行编辑距离运算,得到所述文本信息包含的字符串和文本纠错结果包含的字符串之间的最小编辑距离;The third calculation unit 4033 is used to perform an edit distance operation on each character feature value in the character matrix to obtain the minimum edit distance between the string contained in the text information and the string contained in the text error correction result;
在本实施例中,所述映射处理模块404包括:In this embodiment, the mapping processing module 404 includes:
映射单元4041,用于根据所述最小编辑距离,对所述文本信息包含的字符串和所述文 本纠错结果包含的字符串进行映射处理,得到字符串对应组; Mapping unit 4041, configured to perform mapping processing on the string contained in the text information and the string contained in the text error correction result according to the minimum edit distance, to obtain a string corresponding group;
序列生成单元4042,用于根据预设的编辑操作类型和所述字符串对应组中字符串之间的最小编辑距离,按照把所述文本信息中对应的字符串编辑成所述文本纠错结果中的字符串的编辑方向,构建编辑操作序列; Sequence generation unit 4042, configured to edit the corresponding string in the text information into the text error correction result according to the preset editing operation type and the minimum editing distance between strings in the string corresponding group. The editing direction of the string in , construct the editing operation sequence;
意见输出单元4043,用于按照预设的输出方式输出包含所述文本信息和与其包含的字符串对应的编辑操作序列,得到文本纠错意见;The opinion output unit 4043 is used to output the editing operation sequence including the text information and the string contained therein according to the preset output mode, and obtain text error correction opinions;
在本实施例中,所述模型训练模块405包括:In this embodiment, the model training module 405 includes:
训练数据集生成单元4051,用于采集文本数据,按照预设方式构建训练数据集;The training data set generation unit 4051 is used to collect text data and construct a training data set according to a preset method;
训练单元4052,用于通过硬蒸馏的循环方式将所述训练数据集循环输入至混合架构模型,通过待训练模型的编码解码运算,得到对应的训练结果,并判断所述训练结果是否满足预设的条件,若是,则终止循环,输出初始训练模型。The training unit 4052 is used to cyclically input the training data set into the hybrid architecture model through a hard distillation loop, obtain the corresponding training results through the encoding and decoding operation of the model to be trained, and determine whether the training results meet the preset condition, if yes, terminate the loop and output the initial training model.
通过对上述装置的实施,通过对待纠错文本数据进行预处理,得到文本信息后输入至预先训练得到的文本纠错模型进行文本纠错处理,得到文本信息对应的文本纠错结果;根据最小编辑距离算法,计算文本信息中包含的字符和与其对应的文本纠错结果中包含的字符之间的最小编辑距离;对文本信息中包含的字符和与之对应的所述文本纠错结果中包含的字符按照最小编辑距离进行映射处理,得到文本纠错意见;通过计算最小编辑距离得到本文纠错意见,以体现出错误内容与正确内容之间的关系,并给出错误内容在文本中的位置,以便于用户进行实时调整,解决现有的纠错方案在纠错过程中无法给出具体的错误位置和错误类型,无法对纠错内容直观显示的问题。Through the implementation of the above device, by preprocessing the text data to be corrected, the text information is obtained and then input into the pre-trained text error correction model for text error correction processing, and the text error correction results corresponding to the text information are obtained; according to the minimum edit The distance algorithm calculates the minimum editing distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; the characters contained in the text information and the corresponding characters contained in the text error correction result are Characters are mapped according to the minimum edit distance to obtain text error correction opinions; the error correction opinions of this article are obtained by calculating the minimum edit distance to reflect the relationship between the incorrect content and the correct content, and give the location of the incorrect content in the text. This facilitates users to make real-time adjustments and solves the problem that existing error correction solutions cannot provide specific error locations and error types during the error correction process, and cannot visually display error correction content.
请参阅图6,下面从硬件处理的角度对本申请实施例中的计算机设备的一个实施例进行详细描述。Referring to FIG. 6 , an embodiment of the computer device in the embodiment of the present application is described in detail below from the perspective of hardware processing.
图6是本申请实施例提供的一种计算机设备的结构示意图,该计算机设备600可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)610(例如,一个或一个以上处理器)和存储器620,一个或一个以上存储应用程序633或数据632的存储介质630(例如一个或一个以上海量存储设备)。其中,存储器620和存储介质630可以是短暂存储或持久存储。存储在存储介质630的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对计算机设备600中的一系列指令操作。更进一步地,处理器610可以设置为与存储介质630通信,在计算机设备600上执行存储介质630中的一系列指令操作。Figure 6 is a schematic structural diagram of a computer device provided by an embodiment of the present application. The computer device 600 may vary greatly due to different configurations or performance, and may include one or more processors (central processing units, CPU) 610 (eg, one or more processors) and memory 620, one or more storage media 630 (eg, one or more mass storage devices) storing applications 633 or data 632. Among them, the memory 620 and the storage medium 630 may be short-term storage or persistent storage. The program stored in the storage medium 630 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the computer device 600 . Furthermore, the processor 610 may be configured to communicate with the storage medium 630 and execute a series of instruction operations in the storage medium 630 on the computer device 600 .
计算机设备600还可以包括一个或一个以上电源640,一个或一个以上有线或无线网络接口650,一个或一个以上输入输出接口660,和/或,一个或一个以上操作***631,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图6示出的计算机设备结构并不构成对本申请提供的计算机设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。 Computer device 600 may also include one or more power supplies 640, one or more wired or wireless network interfaces 650, one or more input and output interfaces 660, and/or, one or more operating systems 631, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD and more. Those skilled in the art can understand that the structure of the computer equipment shown in Figure 6 does not constitute a limitation on the computer equipment provided in this application, and may include more or less components than shown in the figure, or combine certain components, or different components. layout.
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,计算机可读存储介质中存储有指令,当指令在计算机上运行时,使得计算机执行以下步骤:获取待纠错文本数据,并对所述待纠错文本数据进行预处理,得到文本信息;将所述文本信息输入至预先训练得到的文本纠错模型进行文本纠错处理,得到所述文本信息对应的文本纠错结果,其中,所述文本纠错模型为混合架构的序列到序列模型,编码器部分采用Transformer模型架构,解码器部分采用长短时记忆模型架构;根据最小编辑距离算法,计算所述文本信息中包含的字符和与其对应的文本纠错结果中包含的字符之间的最小编辑距离;对所述文本信息中包含的字符和与之对应的所述文本纠错结果中包含的字符按照所述最小编辑距离进行映射处理,得到文本纠错意见。This application also provides a computer-readable storage medium. The computer-readable storage medium can be a non-volatile computer-readable storage medium. The computer-readable storage medium can also be a volatile computer-readable storage medium. The computer-readable storage medium can be a non-volatile computer-readable storage medium. There are instructions stored in the read storage medium. When the instructions are run on the computer, the computer performs the following steps: obtain the text data to be corrected and preprocess the text data to be corrected to obtain text information; convert the text The information is input to the pre-trained text error correction model for text error correction processing, and the text error correction result corresponding to the text information is obtained. The text error correction model is a sequence-to-sequence model of a hybrid architecture, and the encoder part adopts Transformer model architecture, the decoder part adopts the long short-term memory model architecture; according to the minimum edit distance algorithm, calculate the minimum edit distance between the characters contained in the text information and the characters contained in the corresponding text error correction result; for all The characters included in the text information and the corresponding characters included in the text error correction result are mapped according to the minimum edit distance to obtain text error correction opinions.
在实际应用中,上述提供的方法可以基于人工智能技术来实现,其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用***。其具体可以是基于服务器来执行,服务器可以是独立的服务器,也可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。In practical applications, the methods provided above can be implemented based on artificial intelligence technology. Artificial intelligence (AI) is the use of digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire Knowledge and use of knowledge to achieve optimal results in theories, methods, techniques and application systems. Specifically, it can be executed based on a server. The server can be an independent server, or it can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, and security services. , Content Delivery Network (CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the devices and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be described again here.
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。Integrated units may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products. Based on this understanding, the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods of various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program code. .
以上,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。Above, the above embodiments are only used to illustrate the technical solution of the present application, but not to limit it. Although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still implement the above-mentioned implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently substituted; however, these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims (20)

  1. 一种文本纠错方法,其中,所述文本纠错方法包括:A text error correction method, wherein the text error correction method includes:
    获取待纠错文本数据,并对所述待纠错文本数据进行预处理,得到文本信息;Obtain the text data to be corrected, and preprocess the text data to be corrected to obtain text information;
    将所述文本信息输入至预先训练得到的文本纠错模型进行文本纠错处理,得到所述文本信息对应的文本纠错结果,其中,所述文本纠错模型中的编码器的架构为Transformer模型架构,所述文本纠错模型中的解码器架构为长短时记忆模型架构;The text information is input into the pre-trained text error correction model for text error correction processing to obtain text error correction results corresponding to the text information, wherein the architecture of the encoder in the text error correction model is a Transformer model Architecture, the decoder architecture in the text error correction model is a long short-term memory model architecture;
    根据最小编辑距离算法,计算所述文本信息包含的字符串和文本纠错结果包含的字符串之间的最小编辑距离;Calculate the minimum edit distance between the string contained in the text information and the string contained in the text error correction result according to the minimum edit distance algorithm;
    对所述文本信息包含的字符串和所述文本纠错结果包含的字符串按照所述最小编辑距离进行映射处理,得到文本纠错意见。The character string contained in the text information and the character string contained in the text error correction result are mapped according to the minimum edit distance to obtain a text error correction opinion.
  2. 根据权利要求1所述的文本纠错方法,其中,所述将所述文本信息输入至预先训练得到的文本纠错模型进行文本纠错处理,得到所述文本信息对应的文本纠错结果,包括:The text error correction method according to claim 1, wherein the text information is input into a pre-trained text error correction model for text error correction processing to obtain a text error correction result corresponding to the text information, including :
    将所述文本信息输入至所述预先训练得到的文本纠错模型,通过采用所述Transformer模型架构的编码器对所述文本信息进行编码,得到文本编码;The text information is input into the pre-trained text error correction model, and the text information is encoded by an encoder using the Transformer model architecture to obtain text encoding;
    根据注意力机制对所述文本编码进行线性变换和投影处理,计算所述文本编码对应的注意力值;Perform linear transformation and projection processing on the text encoding according to the attention mechanism, and calculate the attention value corresponding to the text encoding;
    根据所述文本纠错模型中预设的方式对所述注意力值进行拼接组合,得到注意力值集合;The attention values are spliced and combined according to the preset method in the text error correction model to obtain a set of attention values;
    通过长短时记忆模型调用困惑度值计算算法,对所述注意力值集合中包含的注意力值进行迭代计算,得到对应的困惑度;The perplexity value calculation algorithm is called through the long short-term memory model, and the attention values contained in the attention value set are iteratively calculated to obtain the corresponding perplexity;
    通过所述长短时记忆模型,根据所述困惑度对文本编码进行概率预测,得到概率预测结果;Using the long short-term memory model, perform probability prediction on text encoding according to the perplexity, and obtain a probability prediction result;
    通过所述长短时记忆模型,基于所述概率预测结果对所述文本编码进行解码,得到文本纠错结果。Through the long short-term memory model, the text encoding is decoded based on the probability prediction result to obtain a text error correction result.
  3. 根据权利要求1所述的文本纠错方法,其中,所述根据最小编辑距离算法,计算所述文本信息包含的字符串和文本纠错结果包含的字符串之间的最小编辑距离,包括:The text error correction method according to claim 1, wherein the calculation of the minimum edit distance between the string contained in the text information and the string contained in the text error correction result according to the minimum edit distance algorithm includes:
    提取所述文本信息和与之对应的所述文本纠错结果中的所有字符,形成字符集;Extract the text information and all characters in the corresponding text error correction result to form a character set;
    按照预设的拆分方法对所述字符集进行拆分,形成字符串;Split the character set according to a preset splitting method to form a string;
    根据所述文本信息和所述文本纠错结果之间的对应关系,将所述字符串转换成具有所述对应关系的字符矩阵,其中,所述字符矩阵包含所述字符串中所有字符的字符特征值;According to the correspondence between the text information and the text error correction result, the string is converted into a character matrix with the correspondence, wherein the character matrix contains characters of all characters in the string Eigenvalues;
    根据预设的编辑操作类型,构建动态规划方程;Construct a dynamic programming equation according to the preset editing operation type;
    基于所述动态规划方程对所述字符矩阵中的各字符特征值进行编辑距离运算,得到所述文本信息包含的字符串和文本纠错结果包含的字符串之间的最小编辑距离。Based on the dynamic programming equation, an edit distance operation is performed on each character feature value in the character matrix to obtain the minimum edit distance between the character string contained in the text information and the character string contained in the text error correction result.
  4. 根据权利要求1所述的文本纠错方法,其中,所述对所述文本信息包含的字符串和所述文本纠错结果包含的字符串按照所述最小编辑距离进行映射处理,得到文本纠错意见,包括:The text error correction method according to claim 1, wherein the string contained in the text information and the string contained in the text error correction result are mapped according to the minimum edit distance to obtain text error correction Comments, including:
    根据所述最小编辑距离,对所述文本信息包含的字符串和所述文本纠错结果包含的字符串进行映射处理,得到字符串对应组,其中,每一个所述字符串对应组包括一个文本信息中的字符串和一个文本纠错结果中的字符串;According to the minimum edit distance, the string contained in the text information and the string contained in the text error correction result are mapped to obtain a string corresponding group, wherein each of the string corresponding groups includes a text A string in the message and a string in the text correction result;
    根据预设的编辑操作类型和所述字符串对应组中字符串之间的最小编辑距离,按照把所述文本信息中对应的字符串编辑成所述文本纠错结果中的字符串的编辑方向,构建编辑操作序列;According to the preset editing operation type and the minimum editing distance between the strings in the string corresponding group, the editing direction of editing the corresponding string in the text information into the string in the text error correction result is followed. , construct the editing operation sequence;
    按照预设的输出方式输出包含所述文本信息和与其包含的字符串对应的编辑操作序列,得到文本纠错意见。Output the editing operation sequence including the text information and the string contained therein according to a preset output mode, and obtain text error correction opinions.
  5. 根据权利要求1-4中任一项所述的文本纠错方法,其中,在所述获取待纠错文本数据,并对所述待纠错文本数据进行预处理,得到文本信息之前,还包括:The text error correction method according to any one of claims 1 to 4, wherein before obtaining the text data to be corrected and preprocessing the text data to be corrected to obtain the text information, it further includes: :
    提取Transformer模型框架中的编码器和长短时记忆模型框架中的解码器;Extract the encoder in the Transformer model framework and the decoder in the long short-term memory model framework;
    在所述编码器和所述解码器之间进行嵌入层参数共享,并对嵌入层参数进行因式分解,构建混合架构模型;Sharing embedding layer parameters between the encoder and the decoder, factoring the embedding layer parameters, and constructing a hybrid architecture model;
    从具有纠错信息的文本数据中构建训练数据集,基于所述训练数据集对所述混合架构模型进行学习训练,得到初始训练模型;Construct a training data set from text data with error correction information, perform learning and training on the hybrid architecture model based on the training data set, and obtain an initial training model;
    对所述初始训练模型进行量化处理,得到文本纠错模型。The initial training model is quantized to obtain a text error correction model.
  6. 根据权利要求5所述的文本纠错方法,其中,所述从具有纠错信息的文本数据中构建训练数据集,基于所述训练数据集对所述混合架构模型进行学习训练,得到初始训练模型,包括:The text error correction method according to claim 5, wherein a training data set is constructed from text data with error correction information, and the hybrid architecture model is learned and trained based on the training data set to obtain an initial training model. ,include:
    采集文本数据,按照预设方式构建训练数据集;Collect text data and construct a training data set according to the preset method;
    通过硬蒸馏的循环方式将所述训练数据集循环输入至混合架构模型,通过待训练模型的编码解码运算,得到对应的训练结果;The training data set is cyclically input into the hybrid architecture model through a hard distillation loop, and the corresponding training results are obtained through the encoding and decoding operations of the model to be trained;
    判断所述训练结果是否满足预设的条件;Determine whether the training results meet preset conditions;
    若是,则终止循环,输出初始训练模型。If so, terminate the loop and output the initial training model.
  7. 一种计算机设备,其中,包括:存储器和至少一个处理器,所述存储器中存储有指令,所述存储器和所述至少一个处理器通过线路互连;A computer device, which includes: a memory and at least one processor, instructions are stored in the memory, and the memory and the at least one processor are interconnected through lines;
    所述至少一个处理器调用所述存储器中的所述指令,以使得所述计算机设备执行以下步骤:The at least one processor invokes the instructions in the memory to cause the computer device to perform the following steps:
    获取待纠错文本数据,并对所述待纠错文本数据进行预处理,得到文本信息;Obtain the text data to be corrected, and preprocess the text data to be corrected to obtain text information;
    将所述文本信息输入至预先训练得到的文本纠错模型进行文本纠错处理,得到所述文本信息对应的文本纠错结果,其中,所述文本纠错模型中的编码器的架构为Transformer模型架构,所述文本纠错模型中的解码器架构为长短时记忆模型架构;The text information is input into the pre-trained text error correction model for text error correction processing to obtain text error correction results corresponding to the text information, wherein the architecture of the encoder in the text error correction model is a Transformer model Architecture, the decoder architecture in the text error correction model is a long short-term memory model architecture;
    根据最小编辑距离算法,计算所述文本信息包含的字符串和文本纠错结果包含的字符串之间的最小编辑距离;Calculate the minimum edit distance between the string contained in the text information and the string contained in the text error correction result according to the minimum edit distance algorithm;
    对所述文本信息包含的字符串和所述文本纠错结果包含的字符串按照所述最小编辑距离进行映射处理,得到文本纠错意见。The character string contained in the text information and the character string contained in the text error correction result are mapped according to the minimum edit distance to obtain a text error correction opinion.
  8. 根据权利要求7所述的计算机设备,其中,所述指令被所述处理器执行实现将所述文本信息输入至预先训练得到的文本纠错模型进行文本纠错处理,得到所述文本信息对应的文本纠错结果时,包括:The computer device according to claim 7, wherein the instructions are executed by the processor to input the text information into a pre-trained text error correction model for text error correction processing, and obtain the text information corresponding to the text error correction model. Text error correction results include:
    将所述文本信息输入至所述预先训练得到的文本纠错模型,通过采用所述Transformer模型架构的编码器对所述文本信息进行编码,得到文本编码;The text information is input into the pre-trained text error correction model, and the text information is encoded by an encoder using the Transformer model architecture to obtain text encoding;
    根据注意力机制对所述文本编码进行线性变换和投影处理,计算所述文本编码对应的注意力值;Perform linear transformation and projection processing on the text encoding according to the attention mechanism, and calculate the attention value corresponding to the text encoding;
    根据所述文本纠错模型中预设的方式对所述注意力值进行拼接组合,得到注意力值集合;The attention values are spliced and combined according to the preset method in the text error correction model to obtain a set of attention values;
    通过长短时记忆模型调用困惑度值计算算法,对所述注意力值集合中包含的注意力值进行迭代计算,得到对应的困惑度;The perplexity value calculation algorithm is called through the long short-term memory model, and the attention values contained in the attention value set are iteratively calculated to obtain the corresponding perplexity;
    通过所述长短时记忆模型,根据所述困惑度对文本编码进行概率预测,得到概率预测结果;Using the long short-term memory model, perform probability prediction on text encoding according to the perplexity, and obtain a probability prediction result;
    通过所述长短时记忆模型,基于所述概率预测结果对所述文本编码进行解码,得到文本纠错结果。Through the long short-term memory model, the text encoding is decoded based on the probability prediction result to obtain a text error correction result.
  9. 根据权利要求7所述的计算机设备,其中,所述指令被所述处理器执行实现根据最 小编辑距离算法,计算所述文本信息包含的字符串和文本纠错结果包含的字符串之间的最小编辑距离时,包括:The computer device according to claim 7, wherein the instructions are executed by the processor to calculate the minimum distance between the string contained in the text information and the string contained in the text error correction result according to the minimum edit distance algorithm. When editing distance, include:
    提取所述文本信息和与之对应的所述文本纠错结果中的所有字符,形成字符集;Extract the text information and all characters in the corresponding text error correction result to form a character set;
    按照预设的拆分方法对所述字符集进行拆分,形成字符串;Split the character set according to a preset splitting method to form a string;
    根据所述文本信息和所述文本纠错结果之间的对应关系,将所述字符串转换成具有所述对应关系的字符矩阵,其中,所述字符矩阵包含所述字符串中所有字符的字符特征值;According to the correspondence between the text information and the text error correction result, the string is converted into a character matrix with the correspondence, wherein the character matrix contains characters of all characters in the string Eigenvalues;
    根据预设的编辑操作类型,构建动态规划方程;Construct a dynamic programming equation according to the preset editing operation type;
    基于所述动态规划方程对所述字符矩阵中的各字符特征值进行编辑距离运算,得到所述文本信息包含的字符串和文本纠错结果包含的字符串之间的最小编辑距离。Based on the dynamic programming equation, an edit distance operation is performed on each character feature value in the character matrix to obtain the minimum edit distance between the character string contained in the text information and the character string contained in the text error correction result.
  10. 根据权利要求7所述的计算机设备,其中,所述指令被所述处理器执行实现对所述文本信息包含的字符串和所述文本纠错结果包含的字符串按照所述最小编辑距离进行映射处理,得到文本纠错意见时,包括:The computer device according to claim 7, wherein the instructions are executed by the processor to map the string contained in the text information and the string contained in the text error correction result according to the minimum edit distance. Processing, when obtaining text error correction opinions, including:
    根据所述最小编辑距离,对所述文本信息包含的字符串和所述文本纠错结果包含的字符串进行映射处理,得到字符串对应组,其中,每一个所述字符串对应组包括一个文本信息中的字符串和一个文本纠错结果中的字符串;According to the minimum edit distance, the string contained in the text information and the string contained in the text error correction result are mapped to obtain a string corresponding group, wherein each of the string corresponding groups includes a text A string in the message and a string in the text correction result;
    根据预设的编辑操作类型和所述字符串对应组中字符串之间的最小编辑距离,按照把所述文本信息中对应的字符串编辑成所述文本纠错结果中的字符串的编辑方向,构建编辑操作序列;According to the preset editing operation type and the minimum editing distance between the strings in the string corresponding group, the editing direction of editing the corresponding string in the text information into the string in the text error correction result is followed. , construct the editing operation sequence;
    按照预设的输出方式输出包含所述文本信息和与其包含的字符串对应的编辑操作序列,得到文本纠错意见。Output the editing operation sequence including the text information and the string contained therein according to a preset output mode, and obtain text error correction opinions.
  11. 根据权利要求7-10中任一项所述的计算机设备,其中,所述指令被所述处理器执行实现在所述获取待纠错文本数据,并对所述待纠错文本数据进行预处理,得到文本信息之前,还包括:The computer device according to any one of claims 7-10, wherein the instructions are executed by the processor to obtain the text data to be corrected and preprocess the text data to be corrected. , before getting the text information, it also includes:
    提取Transformer模型框架中的编码器和长短时记忆模型框架中的解码器;Extract the encoder in the Transformer model framework and the decoder in the long short-term memory model framework;
    在所述编码器和所述解码器之间进行嵌入层参数共享,并对嵌入层参数进行因式分解,构建混合架构模型;Sharing embedding layer parameters between the encoder and the decoder, factoring the embedding layer parameters, and constructing a hybrid architecture model;
    从具有纠错信息的文本数据中构建训练数据集,基于所述训练数据集对所述混合架构模型进行学习训练,得到初始训练模型;Construct a training data set from text data with error correction information, perform learning and training on the hybrid architecture model based on the training data set, and obtain an initial training model;
    对所述初始训练模型进行量化处理,得到文本纠错模型。The initial training model is quantized to obtain a text error correction model.
  12. 根据权利要求11所述的计算机设备,其中,所述指令被所述处理器执行实现从具有纠错信息的文本数据中构建训练数据集,基于所述训练数据集对所述混合架构模型进行学习训练,得到初始训练模型时,包括:The computer device according to claim 11, wherein the instructions are executed by the processor to construct a training data set from text data with error correction information, and learn the hybrid architecture model based on the training data set Training, when obtaining the initial training model, includes:
    采集文本数据,按照预设方式构建训练数据集;Collect text data and construct a training data set according to the preset method;
    通过硬蒸馏的循环方式将所述训练数据集循环输入至混合架构模型,通过待训练模型的编码解码运算,得到对应的训练结果;The training data set is cyclically input into the hybrid architecture model through a hard distillation loop, and the corresponding training results are obtained through the encoding and decoding operations of the model to be trained;
    判断所述训练结果是否满足预设的条件;Determine whether the training results meet preset conditions;
    若是,则终止循环,输出初始训练模型。If so, terminate the loop and output the initial training model.
  13. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现以下步骤:A computer-readable storage medium. A computer program is stored on the computer-readable storage medium. It is characterized in that when the computer program is executed by a processor, the following steps are implemented:
    获取待纠错文本数据,并对所述待纠错文本数据进行预处理,得到文本信息;Obtain the text data to be corrected, and preprocess the text data to be corrected to obtain text information;
    将所述文本信息输入至预先训练得到的文本纠错模型进行文本纠错处理,得到所述文本信息对应的文本纠错结果,其中,所述文本纠错模型中的编码器的架构为Transformer模型架构,所述文本纠错模型中的解码器架构为长短时记忆模型架构;The text information is input into the pre-trained text error correction model for text error correction processing to obtain text error correction results corresponding to the text information, wherein the architecture of the encoder in the text error correction model is a Transformer model Architecture, the decoder architecture in the text error correction model is a long short-term memory model architecture;
    根据最小编辑距离算法,计算所述文本信息包含的字符串和文本纠错结果包含的字符 串之间的最小编辑距离;According to the minimum edit distance algorithm, calculate the minimum edit distance between the character string contained in the text information and the character string contained in the text error correction result;
    对所述文本信息包含的字符串和所述文本纠错结果包含的字符串按照所述最小编辑距离进行映射处理,得到文本纠错意见。The character string contained in the text information and the character string contained in the text error correction result are mapped according to the minimum edit distance to obtain a text error correction opinion.
  14. 根据权利要求13所述的计算机可读存储介质,其中,所述计算机程序被所述处理器执行实现将所述文本信息输入至预先训练得到的文本纠错模型进行文本纠错处理,得到所述文本信息对应的文本纠错结果时,包括:The computer-readable storage medium according to claim 13, wherein the computer program is executed by the processor to input the text information into a pre-trained text error correction model for text error correction processing, to obtain the The text error correction results corresponding to the text information include:
    将所述文本信息输入至所述预先训练得到的文本纠错模型,通过采用所述Transformer模型架构的编码器对所述文本信息进行编码,得到文本编码;The text information is input into the pre-trained text error correction model, and the text information is encoded by an encoder using the Transformer model architecture to obtain text encoding;
    根据注意力机制对所述文本编码进行线性变换和投影处理,计算所述文本编码对应的注意力值;Perform linear transformation and projection processing on the text encoding according to the attention mechanism, and calculate the attention value corresponding to the text encoding;
    根据所述文本纠错模型中预设的方式对所述注意力值进行拼接组合,得到注意力值集合;The attention values are spliced and combined according to the preset method in the text error correction model to obtain a set of attention values;
    通过长短时记忆模型调用困惑度值计算算法,对所述注意力值集合中包含的注意力值进行迭代计算,得到对应的困惑度;The perplexity value calculation algorithm is called through the long short-term memory model, and the attention values contained in the attention value set are iteratively calculated to obtain the corresponding perplexity;
    通过所述长短时记忆模型,根据所述困惑度对文本编码进行概率预测,得到概率预测结果;Using the long short-term memory model, perform probability prediction on text encoding according to the perplexity, and obtain a probability prediction result;
    通过所述长短时记忆模型,基于所述概率预测结果对所述文本编码进行解码,得到文本纠错结果。Through the long short-term memory model, the text encoding is decoded based on the probability prediction result to obtain a text error correction result.
  15. 根据权利要求13所述的计算机可读存储介质,其中,所述计算机程序被所述处理器执行实现根据最小编辑距离算法,计算所述文本信息包含的字符串和文本纠错结果包含的字符串之间的最小编辑距离时,包括:The computer-readable storage medium according to claim 13, wherein the computer program is executed by the processor to calculate the character string contained in the text information and the character string contained in the text error correction result according to the minimum edit distance algorithm. The minimum edit distance between:
    提取所述文本信息和与之对应的所述文本纠错结果中的所有字符,形成字符集;Extract the text information and all characters in the corresponding text error correction result to form a character set;
    按照预设的拆分方法对所述字符集进行拆分,形成字符串;Split the character set according to a preset splitting method to form a string;
    根据所述文本信息和所述文本纠错结果之间的对应关系,将所述字符串转换成具有所述对应关系的字符矩阵,其中,所述字符矩阵包含所述字符串中所有字符的字符特征值;According to the correspondence between the text information and the text error correction result, the string is converted into a character matrix with the correspondence, wherein the character matrix contains characters of all characters in the string Eigenvalues;
    根据预设的编辑操作类型,构建动态规划方程;Construct a dynamic programming equation according to the preset editing operation type;
    基于所述动态规划方程对所述字符矩阵中的各字符特征值进行编辑距离运算,得到所述文本信息包含的字符串和文本纠错结果包含的字符串之间的最小编辑距离。Based on the dynamic programming equation, an edit distance operation is performed on each character feature value in the character matrix to obtain the minimum edit distance between the character string contained in the text information and the character string contained in the text error correction result.
  16. 根据权利要求13所述的计算机可读存储介质,其中,所述计算机程序被所述处理器执行实现对所述文本信息包含的字符串和所述文本纠错结果包含的字符串按照所述最小编辑距离进行映射处理,得到文本纠错意见时,包括:The computer-readable storage medium according to claim 13, wherein the computer program is executed by the processor to implement the character string contained in the text information and the character string contained in the text error correction result according to the minimum Edit distance is mapped and text correction opinions are obtained, including:
    根据所述最小编辑距离,对所述文本信息包含的字符串和所述文本纠错结果包含的字符串进行映射处理,得到字符串对应组,其中,每一个所述字符串对应组包括一个文本信息中的字符串和一个文本纠错结果中的字符串;According to the minimum edit distance, the string contained in the text information and the string contained in the text error correction result are mapped to obtain a string corresponding group, wherein each of the string corresponding groups includes a text A string in the message and a string in the text correction result;
    根据预设的编辑操作类型和所述字符串对应组中字符串之间的最小编辑距离,按照把所述文本信息中对应的字符串编辑成所述文本纠错结果中的字符串的编辑方向,构建编辑操作序列;According to the preset editing operation type and the minimum editing distance between the strings in the string corresponding group, the editing direction of editing the corresponding string in the text information into the string in the text error correction result is followed. , construct the editing operation sequence;
    按照预设的输出方式输出包含所述文本信息和与其包含的字符串对应的编辑操作序列,得到文本纠错意见。Output the editing operation sequence including the text information and the string contained therein according to a preset output mode, and obtain text error correction opinions.
  17. 根据权利要求13-16中任一项所述的计算机可读存储介质,其中,所述计算机程序被所述处理器执行实现在所述获取待纠错文本数据,并对所述待纠错文本数据进行预处理,得到文本信息之前,还包括:The computer-readable storage medium according to any one of claims 13-16, wherein the computer program is executed by the processor to obtain the text data to be corrected, and perform the error correction on the text to be corrected. The data is preprocessed and before text information is obtained, it also includes:
    提取Transformer模型框架中的编码器和长短时记忆模型框架中的解码器;Extract the encoder in the Transformer model framework and the decoder in the long short-term memory model framework;
    在所述编码器和所述解码器之间进行嵌入层参数共享,并对嵌入层参数进行因式分解, 构建混合架构模型;Sharing embedding layer parameters between the encoder and the decoder, factoring the embedding layer parameters, and constructing a hybrid architecture model;
    从具有纠错信息的文本数据中构建训练数据集,基于所述训练数据集对所述混合架构模型进行学习训练,得到初始训练模型;Construct a training data set from text data with error correction information, perform learning and training on the hybrid architecture model based on the training data set, and obtain an initial training model;
    对所述初始训练模型进行量化处理,得到文本纠错模型。The initial training model is quantized to obtain a text error correction model.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述计算机程序被所述处理器执行实现从具有纠错信息的文本数据中构建训练数据集,基于所述训练数据集对所述混合架构模型进行学习训练,得到初始训练模型时,包括:The computer-readable storage medium according to claim 17, wherein the computer program is executed by the processor to construct a training data set from text data with error correction information, and performs the hybrid processing based on the training data set. The architecture model is learned and trained, and when the initial training model is obtained, it includes:
    采集文本数据,按照预设方式构建训练数据集;Collect text data and construct a training data set according to the preset method;
    通过硬蒸馏的循环方式将所述训练数据集循环输入至混合架构模型,通过待训练模型的编码解码运算,得到对应的训练结果;The training data set is cyclically input into the hybrid architecture model through a hard distillation loop, and the corresponding training results are obtained through the encoding and decoding operations of the model to be trained;
    判断所述训练结果是否满足预设的条件;Determine whether the training results meet preset conditions;
    若是,则终止循环,输出初始训练模型。If so, terminate the loop and output the initial training model.
  19. 一种文本纠错处理装置,其中,所述文本纠错处理装置包括:A text error correction processing device, wherein the text error correction processing device includes:
    预处理模块,用于获取待纠错文本数据,并对所述待纠错文本数据进行预处理,得到文本信息;A preprocessing module, used to obtain text data to be corrected and preprocess the text data to be corrected to obtain text information;
    文本纠错处理模块,用于将所述文本信息输入至预先训练得到的文本纠错模型进行文本纠错处理,得到所述文本信息对应的文本纠错结果;A text error correction processing module, configured to input the text information into a pre-trained text error correction model for text error correction processing, and obtain a text error correction result corresponding to the text information;
    最小编辑距离计算模块,用于计算所述文本信息和与之对应的所述文本纠错结果之间的最小编辑距离;A minimum edit distance calculation module, used to calculate the minimum edit distance between the text information and the corresponding text error correction result;
    映射处理模块,用于对所述文本信息和与之对应的所述文本纠错结果按照所述最小编辑距离进行映射处理,得到文本纠错意见。A mapping processing module, configured to perform mapping processing on the text information and the corresponding text error correction results according to the minimum edit distance to obtain text error correction opinions.
  20. 根据权利要求19所述的文本纠错处理装置,其中,所述文本纠错处理装置还包括模型训练模块,其用于:The text error correction processing device according to claim 19, wherein the text error correction processing device further includes a model training module for:
    提取Transformer模型框架中的编码器和长短时记忆模型框架中的解码器;Extract the encoder in the Transformer model framework and the decoder in the long short-term memory model framework;
    在所述编码器和所述解码器之间进行嵌入层参数共享,并对嵌入层参数进行因式分解,构建混合架构模型;Sharing embedding layer parameters between the encoder and the decoder, factoring the embedding layer parameters, and constructing a hybrid architecture model;
    从具有纠错信息的文本数据中构建训练数据集,基于所述训练数据集对所述混合架构模型进行学习训练,得到初始训练模型;Construct a training data set from text data with error correction information, perform learning and training on the hybrid architecture model based on the training data set, and obtain an initial training model;
    对所述初始训练模型进行量化处理,得到文本纠错模型。The initial training model is quantized to obtain a text error correction model.
PCT/CN2022/089175 2022-03-17 2022-04-26 Text error correction method and apparatus, device, and storage medium WO2023173533A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210262506.4 2022-03-17
CN202210262506.4A CN114611494B (en) 2022-03-17 2022-03-17 Text error correction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023173533A1 true WO2023173533A1 (en) 2023-09-21

Family

ID=81862921

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/089175 WO2023173533A1 (en) 2022-03-17 2022-04-26 Text error correction method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN114611494B (en)
WO (1) WO2023173533A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117634473A (en) * 2023-12-12 2024-03-01 郑州大学 Grammar error correction early-stop multi-round decoding method and system integrated with source information
CN117744787A (en) * 2024-02-20 2024-03-22 中国电子科技集团公司第十研究所 Intelligent measurement method for first-order research rule knowledge rationality
CN118052627A (en) * 2024-04-15 2024-05-17 辽宁省网联数字科技产业有限公司 Intelligent filling method and system for bidding scheme

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115293139B (en) * 2022-08-03 2023-06-09 北京中科智加科技有限公司 Training method of speech transcription text error correction model and computer equipment
CN115204151A (en) * 2022-09-15 2022-10-18 华东交通大学 Chinese text error correction method, system and readable storage medium
CN116468577B (en) * 2023-03-20 2024-03-08 中慧云启科技集团有限公司 Teaching practical training management system based on B/S architecture
CN116311310A (en) * 2023-05-19 2023-06-23 之江实验室 Universal form identification method and device combining semantic segmentation and sequence prediction
CN116757184B (en) * 2023-08-18 2023-10-20 昆明理工大学 Vietnam voice recognition text error correction method and system integrating pronunciation characteristics

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104464736A (en) * 2014-12-15 2015-03-25 北京百度网讯科技有限公司 Error correction method and device for voice recognition text
CN110852087A (en) * 2019-09-23 2020-02-28 腾讯科技(深圳)有限公司 Chinese error correction method and device, storage medium and electronic device
CN112000805A (en) * 2020-08-24 2020-11-27 平安国际智慧城市科技股份有限公司 Text matching method, device, terminal and storage medium based on pre-training model
US11017167B1 (en) * 2018-06-29 2021-05-25 Intuit Inc. Misspelling correction based on deep learning architecture
CN113297833A (en) * 2020-02-21 2021-08-24 华为技术有限公司 Text error correction method and device, terminal equipment and computer storage medium
CN113836935A (en) * 2021-09-09 2021-12-24 海信视像科技股份有限公司 Server and text error correction method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10878335B1 (en) * 2016-06-14 2020-12-29 Amazon Technologies, Inc. Scalable text analysis using probabilistic data structures
CN109948152B (en) * 2019-03-06 2020-07-17 北京工商大学 L STM-based Chinese text grammar error correction model method
CN113076739A (en) * 2021-04-09 2021-07-06 厦门快商通科技股份有限公司 Method and system for realizing cross-domain Chinese text error correction
CN113935317A (en) * 2021-09-26 2022-01-14 平安科技(深圳)有限公司 Text error correction method and device, electronic equipment and storage medium
CN114154486A (en) * 2021-11-09 2022-03-08 浙江大学 Intelligent error correction system for Chinese corpus spelling errors

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104464736A (en) * 2014-12-15 2015-03-25 北京百度网讯科技有限公司 Error correction method and device for voice recognition text
US11017167B1 (en) * 2018-06-29 2021-05-25 Intuit Inc. Misspelling correction based on deep learning architecture
CN110852087A (en) * 2019-09-23 2020-02-28 腾讯科技(深圳)有限公司 Chinese error correction method and device, storage medium and electronic device
CN113297833A (en) * 2020-02-21 2021-08-24 华为技术有限公司 Text error correction method and device, terminal equipment and computer storage medium
CN112000805A (en) * 2020-08-24 2020-11-27 平安国际智慧城市科技股份有限公司 Text matching method, device, terminal and storage medium based on pre-training model
CN113836935A (en) * 2021-09-09 2021-12-24 海信视像科技股份有限公司 Server and text error correction method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117634473A (en) * 2023-12-12 2024-03-01 郑州大学 Grammar error correction early-stop multi-round decoding method and system integrated with source information
CN117744787A (en) * 2024-02-20 2024-03-22 中国电子科技集团公司第十研究所 Intelligent measurement method for first-order research rule knowledge rationality
CN117744787B (en) * 2024-02-20 2024-05-07 中国电子科技集团公司第十研究所 Intelligent measurement method for first-order research rule knowledge rationality
CN118052627A (en) * 2024-04-15 2024-05-17 辽宁省网联数字科技产业有限公司 Intelligent filling method and system for bidding scheme

Also Published As

Publication number Publication date
CN114611494B (en) 2024-02-02
CN114611494A (en) 2022-06-10

Similar Documents

Publication Publication Date Title
WO2023173533A1 (en) Text error correction method and apparatus, device, and storage medium
WO2022198868A1 (en) Open entity relationship extraction method, apparatus and device, and storage medium
CN107357789B (en) Neural machine translation method fusing multi-language coding information
CN111651557B (en) Automatic text generation method and device and computer readable storage medium
CN108153913B (en) Training method of reply information generation model, reply information generation method and device
CN108170686B (en) Text translation method and device
CN112765345A (en) Text abstract automatic generation method and system fusing pre-training model
CN116737759B (en) Method for generating SQL sentence by Chinese query based on relation perception attention
CN113254616B (en) Intelligent question-answering system-oriented sentence vector generation method and system
CN113051894A (en) Text error correction method and device
CN115374270A (en) Legal text abstract generation method based on graph neural network
CN112417138A (en) Short text automatic summarization method combining pointer generation type and self-attention mechanism
CN112668346A (en) Translation method, device, equipment and storage medium
CN115658898A (en) Chinese and English book entity relation extraction method, system and equipment
CN111680529A (en) Machine translation algorithm and device based on layer aggregation
CN114297220A (en) Data processing method and device, computer equipment and storage medium
CN109979461A (en) A kind of voice translation method and device
CN117877460A (en) Speech synthesis method, device, speech synthesis model training method and device
CN114625759B (en) Model training method, intelligent question-answering method, device, medium and program product
CN112989845B (en) Chapter-level neural machine translation method and system based on routing algorithm
CN115034236A (en) Chinese-English machine translation method based on knowledge distillation
CN114254657A (en) Translation method and related equipment thereof
CN114239548A (en) Triple extraction method for merging dependency syntax and pointer generation network
CN111126047B (en) Method and device for generating synonymous text
CN116451699A (en) Segment extraction type machine reading and understanding method based on attention mechanism

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22931560

Country of ref document: EP

Kind code of ref document: A1