CN113343678A - Text error correction method and device, electronic equipment and storage medium - Google Patents

Text error correction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113343678A
CN113343678A CN202110711749.7A CN202110711749A CN113343678A CN 113343678 A CN113343678 A CN 113343678A CN 202110711749 A CN202110711749 A CN 202110711749A CN 113343678 A CN113343678 A CN 113343678A
Authority
CN
China
Prior art keywords
text
corrected
error correction
content
error
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110711749.7A
Other languages
Chinese (zh)
Inventor
詹明捷
梁鼎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sensetime Technology Development Co Ltd
Original Assignee
Beijing Sensetime Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sensetime Technology Development Co Ltd filed Critical Beijing Sensetime Technology Development Co Ltd
Priority to CN202110711749.7A priority Critical patent/CN113343678A/en
Publication of CN113343678A publication Critical patent/CN113343678A/en
Priority to PCT/CN2021/134638 priority patent/WO2022267353A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure provides a text error correction method, apparatus, electronic device and storage medium, wherein the method comprises: acquiring text content to be corrected; performing multi-dimensional text error correction including word tone dimension and font dimension on the text content to be corrected based on the trained text error correction network to obtain the corrected text content; the text error correction network is obtained by training based on the generated error sentence sample, and the error sentence sample is obtained by destroying the correct sentence sample based on the preset character-pronunciation similar characters and the preset character-shape similar characters. The text error correction network in the disclosure can learn the conversion relation between the error sentence and the correct sentence, so as to guide the fast error correction of the text content to be corrected, and the error correction efficiency is high.

Description

Text error correction method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of information processing technologies, and in particular, to a method and an apparatus for text error correction, an electronic device, and a storage medium.
Background
With the continuous development of science and technology, the Character Recognition technology, especially the Optical Character Recognition (OCR) technology, is more and more widely applied. OCR recognition techniques may recognize textual content from an image. However, the recognized text content may be miswritten due to various influences, such as written fonts, external environments, and the like.
Text error correction is a process of correcting erroneous words in a text. In the related art, the correction can be performed manually, which consumes a lot of time of related personnel and has low error correction efficiency.
Disclosure of Invention
The embodiment of the disclosure at least provides a text error correction method, a text error correction device, an electronic device and a storage medium, so as to improve error correction efficiency.
In a first aspect, an embodiment of the present disclosure provides a method for text error correction, where the method includes:
acquiring text content to be corrected;
performing multi-dimensional text error correction including word tone dimension and character form dimension on the text content to be corrected based on the trained text error correction network to obtain corrected text content;
the text error correction network is obtained by training based on generated error sentence samples, and the error sentence samples are obtained by destroying correct sentence samples based on preset character-pronunciation similar characters and character-shape similar characters.
By adopting the text error correction method, under the condition of acquiring the text content to be corrected, the multi-dimensional text error correction can be carried out on the text content to be corrected based on the trained text error correction network so as to obtain the text content after error correction. Because the text error correction network is obtained by training the error sentence sample obtained by destroying the correct sentence sample based on the preset character-pronunciation similar characters and the preset character-shape similar characters, the text error correction network can learn the conversion relation between the error sentence and the correct sentence, further can guide the quick error correction of the text content to be corrected, and has higher error correction efficiency.
In one possible embodiment, the text correction network is trained as follows:
acquiring a correct statement sample and an error statement sample obtained by text destruction on the correct statement sample; at least one different character exists between the error sentence sample and the correct sentence sample;
and taking the error sentence sample as input data of the text error correction network to be trained to obtain an output result, taking a correct sentence sample corresponding to the error sentence sample as a label of the error sentence sample, and performing at least one round of training on the text error correction network to be trained to obtain the trained text error correction network.
Here, the training of the text error correction network is realized through the comparison result between the label and the output result, and the training is achieved until the output result points to the correct sentence sample, that is, the training obtains the conversion relationship between the error sentence and the correct sentence, and the error correction accuracy of the text error correction network obtained by training is also higher.
In one possible implementation, the error statement sample is obtained as follows:
acquiring a preset candidate character table; the candidate character table comprises a plurality of candidate characters, and pronunciation similar characters and font similar characters corresponding to each candidate character;
and performing text destruction on the correct sentence sample based on the obtained candidate character table to obtain the wrong sentence sample.
Here, the text destruction of the correct sentence sample can be realized based on the pronunciation-like characters and the font-like characters corresponding to the candidate characters, so that the correction can be performed on multiple dimensions such as pronunciation dimensions and font dimensions in the text correction stage, and the accuracy of the correction is improved.
In a possible implementation manner, the text destruction on the correct sentence sample based on the obtained candidate character table to obtain the incorrect sentence sample includes:
carrying out segmentation processing on the correct sentence sample to obtain a plurality of participles;
searching a candidate character matched with the first participle from the candidate character table aiming at the first participle in the participles, and replacing the first participle by utilizing a pronunciation similar character or a font similar character corresponding to the searched candidate character to obtain a replacement result;
and determining the wrong sentence sample obtained by text destruction on the correct sentence sample based on the replacement result.
Here, based on the segmentation processing, the text destruction can be realized based on the replacement operation of the similar characters of the pronunciation or the similar characters of the font, so that the destroyed error sentence sample and the correct sentence sample have certain similarity, and the text error correction network trained based on the similarity can correct the error well.
In a possible implementation manner, the obtaining of the text content to be corrected includes:
receiving content to be checked uploaded by a client, wherein the type of the content to be checked comprises at least one item of text and images, and the content to be checked comprises text content to be corrected.
In one possible embodiment of the method according to the invention,
in the case that the content to be checked comprises a text, the text content to be corrected comprises characters or character strings in the text; and/or the presence of a gas in the gas,
and in the case that the content to be checked comprises an image, the text content to be corrected comprises characters or character strings in the text recognized from the image by using a character recognition mode.
In a possible implementation, after obtaining the corrected text content, the method further includes:
returning error correction prompt information to the client; and the error correction prompt information is used for indicating the position to be corrected corresponding to the text content to be corrected in the content to be checked.
Here, in the case of determining the text content after error correction, the user may be prompted with the position to be corrected based on the presentation of the error correction prompt information, so that the user can know the specific error correction position conveniently, and correct the position in time.
In a possible implementation manner, the error correction prompt information is further used for providing reference text content corresponding to the erroneous text content in the text content to be corrected, and the method further includes:
and responding to the trigger instruction aiming at the position to be corrected, and displaying the corrected reference text content.
The reference text content after relevant error correction can be displayed, and a user can select the desired error correction text content based on the displayed content without manual input of the user, so that time and labor are saved.
In a possible implementation, the presenting the corrected reference text content includes:
displaying the corrected reference text content at the corrected position corresponding to the position to be corrected by a preset display special effect;
or, replacing the text content to be corrected with the corrected reference text content, and displaying the corrected reference text content at the position to be corrected;
or displaying the text content to be corrected and the corrected reference text content in a split screen mode.
In one possible embodiment, the text comprises a plurality of articles; the method further comprises the following steps:
determining error correction historical information of the client based on error correction prompt information generated for each article in the plurality of articles within a preset time period; the error correction history information comprises at least one of the error correction times in a single article, the total error correction times in a plurality of articles, text contents to be corrected corresponding to different articles generating the same error and text types to which the text contents to be corrected belong;
and determining a performance assessment result aiming at the client according to the error correction historical information.
In a second aspect, an embodiment of the present disclosure further provides an apparatus for text error correction, where the apparatus includes:
the acquisition module is used for acquiring text contents to be corrected;
the error correction module is used for performing multi-dimensional text error correction including word and tone dimensions and character and shape dimensions on the text content to be corrected based on the trained text error correction network to obtain the corrected text content;
the text error correction network is obtained by training based on generated error sentence samples, and the error sentence samples are obtained by destroying correct sentence samples based on preset character-pronunciation similar characters and character-shape similar characters.
In a third aspect, an embodiment of the present disclosure further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the method of text error correction according to the first aspect and any of its various embodiments.
In a fourth aspect, the disclosed embodiments also provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps of the method for text error correction according to the first aspect and any of its various embodiments.
For the description of the effect of the text error correction apparatus, the electronic device, and the computer-readable storage medium, reference is made to the description of the text error correction method, which is not repeated herein.
In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.
Fig. 1 shows a flowchart of a method for text error correction provided by an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of an apparatus for text error correction provided by an embodiment of the present disclosure;
fig. 3 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The term "and/or" herein merely describes an associative relationship, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.
The research finds that text error correction is a process for correcting wrong characters in the text. In the related art, the correction can be performed manually, which consumes a lot of time of related personnel and has low error correction efficiency.
Based on the above research, the present disclosure provides a method, an apparatus, an electronic device, and a storage medium for text error correction, so as to improve error correction efficiency.
To facilitate understanding of the present embodiment, first, a text error correction method disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the text error correction method provided in the embodiments of the present disclosure is generally a computer device with certain computing capability, and the computer device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device. In some possible implementations, the method for text error correction may be implemented by a processor calling computer readable instructions stored in a memory.
Referring to fig. 1, a flowchart of a text error correction method provided by an embodiment of the present disclosure is shown, where the method includes steps S101 to S102, where:
s101: acquiring text content to be corrected;
s102: performing multi-dimensional text error correction including word tone dimension and font dimension on the text content to be corrected based on the trained text error correction network to obtain the corrected text content; the text error correction network is obtained by training based on the generated error sentence sample, and the error sentence sample is obtained by destroying the correct sentence sample based on the preset character-pronunciation similar characters and the preset character-shape similar characters.
In order to facilitate understanding of the text error correction method provided by the embodiments of the present disclosure, first, a brief description is provided below on an application scenario of the method. The text error correction method can be applied to various fields needing text error correction, such as the field of voice Recognition, the field of Optical Character Recognition (OCR), the field of new media, the field of question answering and the like. In view of the wide application of OCR recognition, the following description is given by exemplifying character error correction in OCR recognition.
The manual correction method in the related art cannot ensure the efficiency of error correction, which greatly affects the wide application of the text error correction technology in various technical fields. In order to improve the error correction efficiency, the embodiment of the present disclosure provides a scheme for performing multi-dimensional text error correction by using a trained word error correction network, which significantly improves the error correction efficiency because the word error correction network can be pre-trained, and at the same time, the word error correction network can perform error correction from multiple dimensions including word tone dimensions and font dimensions, which further improves the accuracy of error correction.
Based on different application scenarios, the text content to be corrected obtained in the embodiment of the present disclosure may also be different. For example, in the field of plain text recognition, the text content to be corrected may be content on a basic article, or may be content on a literary work, for example, content on a certain paragraph on a novel, and may be specifically presented in the form of characters or character strings; for another example, in the field of OCR recognition, the text content to be corrected may be text content that is edited by intelligently recognizing the text content in the image, and specifically may also be presented in the form of characters or character strings.
In a specific application, the text content to be corrected may be obtained from the received content to be verified uploaded by the client. The content to be checked here may include text or image, the content to be corrected corresponding to the text may be text content, for example, content such as articles and novels, and the content to be corrected corresponding to the image may be text content recognized from the image by using a word recognition method, for example, text content determined from an image such as a poster or the like provided with text information by means of word detection and word recognition.
Under the condition of acquiring the text content to be corrected, the multi-dimensional text correction can be realized based on the trained text correction network, and further the text content after error correction can be directly obtained.
The text error correction network in the embodiment of the present disclosure may train a conversion relationship between an error sentence sample and a correct sentence sample, where the error sentence sample may be obtained by destroying the correct sentence sample based on the similar characters of the character pronunciation and the similar characters of the character font, so that the text content to be corrected may be corrected when the conversion relationship is learned.
It should be noted that, because the text destruction is based on the destruction of two layers of the pronunciation-like character and the font-like character, the text content to be corrected can be corrected in at least two dimensions, such as pronunciation dimension and font dimension, i.e., regardless of whether a pronunciation error or a font error exists in the text content to be corrected, the embodiment of the present disclosure can perform efficient and accurate correction.
The OCR recognition domain is still used as an example here. For example, in the case where the text content to be corrected is "snake eye", it is known that there is a word-sound error, and here, correction can be achieved with respect to word-sound similarity between "eye" and "mirror"; for another example, in the case where the text content to be corrected is "iron deficiency type pelvic blood", it is known that there is a glyph error, and here, correction can be achieved with respect to glyph similarity between "pot" and "poor".
The following describes in detail a specific training process of the text error correction network, which mainly includes the following steps:
acquiring a correct statement sample and carrying out text destruction on the correct statement sample to obtain an incorrect statement sample; at least one different character exists between the error sentence sample and the correct sentence sample;
and step two, taking the error sentence samples as input data of the text error correction network to be trained to obtain an output result, taking correct sentence samples corresponding to the error sentence samples as labels of the error sentence samples, and performing at least one round of training on the text error correction network to obtain the trained text error correction network.
Here, correct sentence samples and incorrect sentence samples need to be obtained in advance, and then the incorrect sentence samples and the corresponding correct sentence samples can be respectively used as input items and output comparison items of a text error correction network to be trained to realize training of the text error correction network, wherein the correct sentence samples are used as the output comparison items and can be used as supervision information of the corresponding incorrect sentence samples to supervise network training.
In the process of training the text error correction network, the error sentence sample can be input into the text error correction network to be trained, then an output result obtained by network output is compared with a correct sentence sample (training label) corresponding to the error sentence sample, if the comparison result is inconsistent, the network parameter value of the text error correction network can be adjusted, and the next round of training can be carried out based on the adjusted text error correction network until the comparison result is highly matched, so that the trained text error correction network is obtained.
The above-mentioned incorrect sentence sample can be obtained based on a text destruction operation on the correct sentence sample, and can be specifically realized by the following steps:
step one, acquiring a preset candidate character table; the candidate character table comprises a plurality of candidate characters, and pronunciation similar characters and font similar characters corresponding to each candidate character;
and step two, performing text destruction on the correct sentence sample based on the acquired candidate character table to obtain an incorrect sentence sample.
The candidate character table here may be set in advance. The candidate character table may store, as prior knowledge, a candidate character and a phonetic similar character and a font similar character corresponding to the candidate character. In this way, for a correct sentence sample to be destroyed, a corresponding incorrect sentence sample can be determined based on the above-mentioned a priori knowledge.
The candidate character table is stored with candidate characters as basic storage units. The candidate characters may be obtained from an existing character set, or may be high-frequency characters, common characters, error-prone and confusable characters, and the like collected based on different application scenarios, for example, for the speech recognition field, the candidate characters may include characters such as a word, a place, and the like, and for the driving field, the candidate characters may be characters such as a vehicle, and the like, and are not described herein again.
The embodiment of the disclosure can determine which characters are easy to be confused with the candidate character (corresponding to the character-shape similar character) and which characters have the same or similar pronunciation as the candidate character (corresponding to the character-sound similar character) aiming at the candidate character in the candidate character table, and further establish the corresponding relation between the character-sound similar character and the character-shape similar character and the corresponding candidate character. In this way, when text destruction is required for a correct sentence sample, the text destruction can be performed based on the correspondence.
In order to satisfy the requirement of error correction for the text content to be corrected, the correct sentence sample may be a short sentence with context, and in general, the short sentence may contain a plurality of characters. In the embodiment of the present disclosure, under the condition that text destruction is performed based on the candidate characters in the candidate character table, segmentation may be performed first, and then replacement may be performed, so as to implement targeted text destruction operation, which is specifically implemented by the following steps:
firstly, segmenting a correct sentence sample to obtain a plurality of participles;
step two, aiming at a first word segmentation in the multiple word segmentation, searching a candidate character matched with the first word segmentation from a candidate character table, and replacing the first word segmentation by using a pronunciation similar character or a font similar character corresponding to the searched candidate character to obtain a replacement result;
and step three, determining an error sentence sample obtained by text destruction on the correct sentence sample based on the replacement result.
The first term herein may be an optional term or terms, or a particular term or terms. The replacement result can correspond to the replacement result corresponding to each participle in the one or more corresponding participles, so that the obtained error sentence sample has more diversity, and the subsequent text error correction network can conveniently carry out error correction learning.
It should be noted that, for the case that the first word segmentation selects a specific word segmentation to perform text destruction, the selection of the specific word segmentation can be performed based on the starting points of characters which are easy to confuse and error, so that the generated error sentence sample can have more pertinence, and can be better adapted to a specific scene.
The segmentation processing in the embodiment of the present disclosure may be implemented based on methods such as a dictionary, a statistic, a neural network, and the like. For example, greedy matching can be used to implement dictionary-based segmentation, and in practical applications, a dictionary can be looked up from a first word at the beginning of a sentence, a longest word in the dictionary beginning with the word is found, and then the first segmented word is obtained; for another example, global starting can be adopted to realize statistical-based segmentation, in practical application, the most reasonable combination can be found out from various segmentation word combinations, and the process can be regarded as finding out a path with the maximum probability in the segmentation word graph; for another example, the segmentation may be implemented based on a Long Short-Term Memory network (LSTM), which is a time-loop neural network, and the specific method is not described herein again.
No matter which segmentation processing is realized based on the above method, in the process of text destruction, for a first word segmentation composed of a plurality of characters, the whole word segmentation is not destroyed, but one of the characters can be selected for destruction, so as to better adapt to the error correction requirement.
In addition, in order to better meet the error correction requirement, text destruction can be performed on a plurality of first sub-words in practical application. Considering that if several consecutive segmentations are destroyed, the difficulty of subsequent error correction will be increased, the accuracy of error correction will also be reduced, and in order to reduce the possible adverse effects caused by the above problems, a plurality of first segmentations may be set at intervals. The intervals between the first terms may be the same or different, and are not limited herein. Therefore, the condition that a plurality of continuous participles are all wrong is avoided, the diversity of wrong samples can be met, and the accuracy of subsequent error correction can be ensured.
The text error correction method provided by the embodiment of the disclosure can be applied to auditing of contents to be issued, for example, an editing manager can perform management auditing on articles edited by an editor; or the method is applied to the case where the error of the related content needs to be checked, for example, the method may be used to perform proofreading and verification on the text in the article in the process of uploading the article by the author, and in addition, the method may be applied to various links where text error correction needs to be performed, and no specific limitation is made here.
Here, after the text content after error correction is obtained, error correction prompt information may be returned to the client to indicate a position to be corrected corresponding to the text content to be corrected in the content to be checked. The user can confirm that the text error occurs based on the prompted position to be corrected, and besides, the selection of the candidate text content after error correction can be realized in a mode of displaying the reference text content after error correction by aiming at the triggering instruction of the position to be corrected, so that the user can actively modify the reference text content after error correction.
The display of the corrected reference text content is various, and the display can be performed in combination with special effects.
Specifically, the corrected reference text content may be displayed in a preset special effect at the corrected position corresponding to the position to be corrected, for example, a deletion line may be added to the original error text (i.e., the text content to be corrected), a correct text (i.e., the corrected reference text content) may be displayed nearby, and then the correct text may be displayed in a special effect manner such as a pop-up window or a bubble box; the corrected reference text content can be used for replacing the text content to be corrected, and the corrected reference text content is displayed at the position to be corrected, for example, a correct text replacing the original error text can be highlighted; in addition, other display modes may be adopted in the embodiment of the present disclosure, and no specific limitation is made herein.
The text error correction method provided by the embodiment of the disclosure helps a user to correct the text error, and can perform error statistics based on single/single error conditions to realize performance assessment.
The obtained error correction history information may include the number of times of error correction in a single article, the total number of times of error correction in multiple articles, the text content to be corrected and the text type thereof corresponding to different articles that generate the same error, or other statistical information, for example, statistics of the average number of times of error correction may be implemented in combination with a time period, so as to quantitatively evaluate related people.
For example, for an editor, the more the error correction times appear in a single article edited by the editor, the more the error correction times appear in the multiple articles edited by the editor, the more the worker is in the state of being in a wrong working attitude, the text type of the text content to be corrected corresponding to different articles generating the same error is the same type, and the more the worker is in the state of being not sufficiently known about the field corresponding to the article. With timely knowledge of these statistics, more targeted management countermeasures can be given to the editor.
It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.
Based on the same inventive concept, the embodiment of the present disclosure further provides a text error correction device corresponding to the text error correction method, and as the principle of the device in the embodiment of the present disclosure for solving the problem is similar to the text error correction method in the embodiment of the present disclosure, the implementation of the device may refer to the implementation of the method, and repeated details are not repeated.
Referring to fig. 2, a schematic diagram of an apparatus for text error correction provided in an embodiment of the present disclosure is shown, the apparatus including: an acquisition module 201 and an error correction module 202; wherein,
an obtaining module 201, configured to obtain text content to be corrected;
the error correction module 202 is configured to perform multi-dimensional text error correction including word-tone dimensions and font dimensions on text content to be corrected based on the trained text error correction network to obtain corrected text content;
the text error correction network is obtained by training based on the generated error sentence sample, and the error sentence sample is obtained by destroying the correct sentence sample based on the preset character-pronunciation similar characters and the preset character-shape similar characters.
By adopting the text error correction device, under the condition of acquiring the text content to be corrected, the multi-dimensional text error correction can be carried out on the text content to be corrected based on the trained text error correction network so as to obtain the text content after error correction. Because the text error correction network is obtained by training the error sentence sample obtained by destroying the correct sentence sample based on the preset character-pronunciation similar characters and the preset character-shape similar characters, the text error correction network can learn the conversion relation between the error sentence and the correct sentence, further can guide the quick error correction of the text content to be corrected, and has higher error correction efficiency.
In one possible embodiment, the apparatus includes a training module 203;
a training module 203, configured to train the text error correction network according to the following steps:
acquiring a correct statement sample and performing text destruction on the correct statement sample to obtain an incorrect statement sample; at least one different character exists between the error sentence sample and the correct sentence sample;
and taking the error sentence sample as input data of the text error correction network to be trained to obtain an output result, taking a correct sentence sample corresponding to the error sentence sample as a label of the error sentence sample, and performing at least one round of training on the text error correction network to be trained to obtain the trained text error correction network.
In a possible implementation, the training module 203 is configured to obtain the error statement sample according to the following steps:
acquiring a preset candidate character table; the candidate character table comprises a plurality of candidate characters, and pronunciation similar characters and font similar characters corresponding to each candidate character;
and performing text destruction on the correct sentence sample based on the acquired candidate character table to obtain an incorrect sentence sample.
In a possible implementation manner, the training module 203 is configured to perform text destruction on the correct sentence sample based on the obtained candidate character table to obtain an incorrect sentence sample according to the following steps:
carrying out segmentation processing on the correct sentence sample to obtain a plurality of participles;
searching a candidate character matched with the first word segmentation from a candidate character table aiming at the first word segmentation in the plurality of word segmentation, and replacing the first word segmentation by utilizing a character-voice similar character or a character-shape similar character corresponding to the searched candidate character to obtain a replacement result;
and determining an error sentence sample obtained by text destruction on the correct sentence sample based on the replacement result.
In a possible implementation manner, the obtaining module 201 is configured to obtain the text content to be corrected according to the following steps:
receiving the content to be checked uploaded by the client, wherein the type of the content to be checked comprises at least one item of text and images, and the content to be checked comprises the text content to be corrected.
In one possible implementation, in the case that the content to be checked includes text, the text content to be corrected includes characters or character strings in the text; and/or;
in the case where the content to be checked includes an image, the text content to be corrected includes characters or character strings in text recognized from the image by a character recognition method.
In a possible embodiment, the above apparatus further comprises:
the prompt module 204 is configured to return error correction prompt information to the client after obtaining the text content after error correction; the error correction prompt information is used for indicating the position to be corrected corresponding to the text content to be corrected in the content to be checked.
In a possible implementation manner, the error correction prompt information is further used for providing reference text content corresponding to the erroneous text content in the text content to be corrected, and the apparatus further includes:
and the presentation module 205 is configured to respond to a trigger instruction for a position to be corrected, and present the corrected reference text content.
In a possible implementation, the presentation module 205 is configured to present the corrected reference text content according to the following steps:
displaying the corrected reference text content at the corrected position corresponding to the position to be corrected by a preset display special effect;
or replacing the text content to be corrected with the corrected reference text content, and displaying the corrected reference text content at the position to be corrected;
or displaying the text content to be corrected and the corrected reference text content in a split screen mode.
In one possible embodiment, the text includes a plurality of articles; the above-mentioned device still includes:
the assessment module 206 is configured to determine error correction history information of the client based on error correction prompt information generated for each article in the plurality of articles within a preset time period; the error correction history information comprises at least one of the error correction times in a single article, the total error correction times in a plurality of articles, text contents to be corrected corresponding to different articles generating the same error and text types to which the text contents to be corrected belong; and determining a performance assessment result aiming at the client according to the error correction historical information.
The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.
An embodiment of the present disclosure further provides an electronic device, as shown in fig. 3, which is a schematic structural diagram of the electronic device provided in the embodiment of the present disclosure, and the electronic device includes: a processor 301, a memory 302, and a bus 303. The memory 302 stores machine-readable instructions executable by the processor 301 (for example, execution instructions corresponding to the obtaining module 201 and the error correction module 202 in the apparatus in fig. 2, and the like), when the electronic device is operated, the processor 301 and the memory 302 communicate via the bus 303, and when the machine-readable instructions are executed by the processor 301, the following processes are performed:
acquiring text content to be corrected;
performing multi-dimensional text error correction including word tone dimension and font dimension on the text content to be corrected based on the trained text error correction network to obtain the corrected text content;
the text error correction network is obtained by training based on the generated error sentence sample, and the error sentence sample is obtained by destroying the correct sentence sample based on the preset character-pronunciation similar characters and the preset character-shape similar characters.
The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the method for text error correction described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.
The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the text error correction method in the foregoing method embodiments, which may be referred to specifically as the foregoing method embodiments, and are not described herein again.
The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (13)

1. A method of text correction, the method comprising:
acquiring text content to be corrected;
performing multi-dimensional text error correction including word tone dimension and character form dimension on the text content to be corrected based on the trained text error correction network to obtain corrected text content;
the text error correction network is obtained by training based on generated error sentence samples, and the error sentence samples are obtained by destroying correct sentence samples based on preset character-pronunciation similar characters and character-shape similar characters.
2. The method of claim 1, wherein the text correction network is trained according to the following steps:
acquiring a correct statement sample and an error statement sample obtained by text destruction on the correct statement sample; at least one different character exists between the error sentence sample and the correct sentence sample;
and taking the error sentence sample as input data of the text error correction network to be trained to obtain an output result, taking a correct sentence sample corresponding to the error sentence sample as a label of the error sentence sample, and performing at least one round of training on the text error correction network to be trained to obtain the trained text error correction network.
3. The method according to claim 1 or 2, characterized in that the error sentence sample is obtained as follows:
acquiring a preset candidate character table; the candidate character table comprises a plurality of candidate characters, and pronunciation similar characters and font similar characters corresponding to each candidate character;
and performing text destruction on the correct sentence sample based on the obtained candidate character table to obtain the wrong sentence sample.
4. The method of claim 3, wherein the text-destroying the correct sentence sample based on the obtained candidate character list to obtain the incorrect sentence sample comprises:
carrying out segmentation processing on the correct sentence sample to obtain a plurality of participles;
searching a candidate character matched with the first participle from the candidate character table aiming at the first participle in the participles, and replacing the first participle by utilizing a pronunciation similar character or a font similar character corresponding to the searched candidate character to obtain a replacement result;
and determining the wrong sentence sample obtained by text destruction on the correct sentence sample based on the replacement result.
5. The method according to any one of claims 1 to 4, wherein the obtaining of the text content to be corrected comprises:
receiving content to be checked uploaded by a client, wherein the type of the content to be checked comprises at least one item of text and images, and the content to be checked comprises text content to be corrected.
6. The method of claim 5,
in the case that the content to be checked comprises a text, the text content to be corrected comprises characters or character strings in the text; and/or the presence of a gas in the gas,
and in the case that the content to be checked comprises an image, the text content to be corrected comprises characters or character strings in the text recognized from the image by using a character recognition mode.
7. The method according to claim 5 or 6, wherein after obtaining the corrected text content, the method further comprises:
returning error correction prompt information to the client; and the error correction prompt information is used for indicating the position to be corrected corresponding to the text content to be corrected in the content to be checked.
8. The method according to claim 7, wherein the error correction prompt information is further used for providing reference text content corresponding to the erroneous text content in the text content to be corrected, and the method further comprises:
and responding to the trigger instruction aiming at the position to be corrected, and displaying the corrected reference text content.
9. The method of claim 8, wherein presenting the corrected reference text content comprises:
displaying the corrected reference text content at the corrected position corresponding to the position to be corrected by a preset display special effect;
or, replacing the text content to be corrected with the corrected reference text content, and displaying the corrected reference text content at the position to be corrected;
or displaying the text content to be corrected and the corrected reference text content in a split screen mode.
10. The method of any of claims 7-9, wherein the text comprises a plurality of articles; the method further comprises the following steps:
determining error correction historical information of the client based on error correction prompt information generated for each article in the plurality of articles within a preset time period; the error correction history information comprises at least one of the error correction times in a single article, the total error correction times in a plurality of articles, text contents to be corrected corresponding to different articles generating the same error and text types to which the text contents to be corrected belong;
and determining a performance assessment result aiming at the client according to the error correction historical information.
11. An apparatus for correcting text, the apparatus comprising:
the acquisition module is used for acquiring text contents to be corrected;
the error correction module is used for performing multi-dimensional text error correction including word and tone dimensions and character and shape dimensions on the text content to be corrected based on the trained text error correction network to obtain the corrected text content;
the text error correction network is obtained by training based on generated error sentence samples, and the error sentence samples are obtained by destroying correct sentence samples based on preset character-pronunciation similar characters and character-shape similar characters.
12. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the method of text correction according to any of claims 1 to 10.
13. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for text correction according to one of claims 1 to 10.
CN202110711749.7A 2021-06-25 2021-06-25 Text error correction method and device, electronic equipment and storage medium Pending CN113343678A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110711749.7A CN113343678A (en) 2021-06-25 2021-06-25 Text error correction method and device, electronic equipment and storage medium
PCT/CN2021/134638 WO2022267353A1 (en) 2021-06-25 2021-11-30 Text error correction method and apparatus, and electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110711749.7A CN113343678A (en) 2021-06-25 2021-06-25 Text error correction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113343678A true CN113343678A (en) 2021-09-03

Family

ID=77478919

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110711749.7A Pending CN113343678A (en) 2021-06-25 2021-06-25 Text error correction method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN113343678A (en)
WO (1) WO2022267353A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022267353A1 (en) * 2021-06-25 2022-12-29 北京市商汤科技开发有限公司 Text error correction method and apparatus, and electronic device and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306598B (en) * 2023-05-22 2023-09-08 上海蜜度信息技术有限公司 Customized error correction method, system, equipment and medium for words in different fields
CN116719424B (en) * 2023-08-09 2024-03-22 腾讯科技(深圳)有限公司 Determination method and related device for type identification model
CN117094311B (en) * 2023-10-19 2024-01-26 山东齐鲁壹点传媒有限公司 Method for establishing error correction filter for Chinese grammar error correction

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100063798A1 (en) * 2008-09-09 2010-03-11 Tsun Ku Error-detecting apparatus and methods for a chinese article
US20200192983A1 (en) * 2018-12-17 2020-06-18 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and device for correcting error in text
CN112396049A (en) * 2020-11-19 2021-02-23 平安普惠企业管理有限公司 Text error correction method and device, computer equipment and storage medium
CN112597753A (en) * 2020-12-22 2021-04-02 北京百度网讯科技有限公司 Text error correction processing method and device, electronic equipment and storage medium
CN112784582A (en) * 2021-02-09 2021-05-11 中国工商银行股份有限公司 Error correction method and device and computing equipment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140214401A1 (en) * 2013-01-29 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method and device for error correction model training and text error correction
CN111611791B (en) * 2020-04-27 2023-08-25 鼎富智能科技有限公司 Text processing method and related device
CN112016310A (en) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 Text error correction method, system, device and readable storage medium
CN112560450B (en) * 2020-12-11 2024-02-13 科大讯飞股份有限公司 Text error correction method and device
CN112926306B (en) * 2021-03-08 2024-01-23 北京百度网讯科技有限公司 Text error correction method, device, equipment and storage medium
CN113343678A (en) * 2021-06-25 2021-09-03 北京市商汤科技开发有限公司 Text error correction method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100063798A1 (en) * 2008-09-09 2010-03-11 Tsun Ku Error-detecting apparatus and methods for a chinese article
US20200192983A1 (en) * 2018-12-17 2020-06-18 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and device for correcting error in text
CN112396049A (en) * 2020-11-19 2021-02-23 平安普惠企业管理有限公司 Text error correction method and device, computer equipment and storage medium
CN112597753A (en) * 2020-12-22 2021-04-02 北京百度网讯科技有限公司 Text error correction processing method and device, electronic equipment and storage medium
CN112784582A (en) * 2021-02-09 2021-05-11 中国工商银行股份有限公司 Error correction method and device and computing equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘勤等: "智能财务研究蓝皮书(第一辑)", vol. 1, 立信会计出版社, pages: 1 - 4 *
郝亚男等: "面向OCR文本识别词错误自动校对方法研究", 《计算机仿真》, no. 09, 15 September 2020 (2020-09-15) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022267353A1 (en) * 2021-06-25 2022-12-29 北京市商汤科技开发有限公司 Text error correction method and apparatus, and electronic device and storage medium

Also Published As

Publication number Publication date
WO2022267353A1 (en) 2022-12-29

Similar Documents

Publication Publication Date Title
CN113343678A (en) Text error correction method and device, electronic equipment and storage medium
US10853576B2 (en) Efficient and accurate named entity recognition method and apparatus
CN107748784B (en) Method for realizing structured data search through natural language
US20080294982A1 (en) Providing relevant text auto-completions
US10963717B1 (en) Auto-correction of pattern defined strings
TWI567569B (en) Natural language processing systems, natural language processing methods, and natural language processing programs
JP5502814B2 (en) Method and system for assigning diacritical marks to Arabic text
CN113255331B (en) Text error correction method, device and storage medium
CN112926300A (en) Image searching method, image searching device and terminal equipment
CN112860845A (en) Test question retrieval method and device, electronic equipment and storage medium
CN112765319A (en) Text processing method and device, electronic equipment and storage medium
CN112149680A (en) Wrong word detection and identification method and device, electronic equipment and storage medium
CN113934834A (en) Question matching method, device, equipment and storage medium
CN117077679B (en) Named entity recognition method and device
CN112347267A (en) Text processing method and device, computer equipment and storage medium
CN112182353A (en) Method, electronic device, and storage medium for information search
CN113591857A (en) Character image processing method and device and ancient Chinese book image identification method
CN112836498A (en) Data processing method, data identification device and computing equipment
CN111209724A (en) Text verification method and device, storage medium and processor
CN112784780B (en) Review method, review device, computer equipment and storage medium
US11935425B2 (en) Electronic device, pronunciation learning method, server apparatus, pronunciation learning processing system, and storage medium
CN114118052A (en) Text marking method and device, computer equipment and storage medium
CN112347790A (en) Text processing method and device, computer equipment and storage medium
CN114187594A (en) Training method of text recognition model, text recognition method, electronic device, and storage medium
CN114254627A (en) Text error correction method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40049357

Country of ref document: HK

RJ01 Rejection of invention patent application after publication

Application publication date: 20210903