WO2022267353A1 - 文本纠错的方法、装置、电子设备及存储介质 - Google Patents

文本纠错的方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2022267353A1
WO2022267353A1 PCT/CN2021/134638 CN2021134638W WO2022267353A1 WO 2022267353 A1 WO2022267353 A1 WO 2022267353A1 CN 2021134638 W CN2021134638 W CN 2021134638W WO 2022267353 A1 WO2022267353 A1 WO 2022267353A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
error correction
corrected
content
text content
Prior art date
Application number
PCT/CN2021/134638
Other languages
English (en)
French (fr)
Inventor
詹明捷
梁鼎
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Publication of WO2022267353A1 publication Critical patent/WO2022267353A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates to the technical field of information processing, and in particular, to a text error correction method, device, electronic equipment, and storage medium.
  • OCR recognition technology can recognize text content from images.
  • typos may appear in the recognized text content.
  • Text error correction is the process of correcting typos in the text. In related technologies, corrections can be made manually, which will consume a lot of time for relevant personnel, and the error correction efficiency is low.
  • Embodiments of the present disclosure at least provide a text error correction method, device, electronic equipment, and storage medium, so as to improve error correction efficiency.
  • an embodiment of the present disclosure provides a method for text error correction, the method comprising: obtaining the text content to be corrected; and including the phonetic dimension of the text content to be corrected based on the trained text error correction network And the multi-dimensional text error correction of the font dimension to obtain the text content after error correction; wherein, the text error correction network is obtained based on the error sentence sample training, and the error sentence sample is based on the preset phonetically similar characters and font Similar characters are obtained by destroying correct sentence samples.
  • multi-dimensional text error correction can be performed on the text content to be corrected based on the trained text error correction network to obtain the corrected text content. Since the text error correction network is trained based on the wrong sentence samples, the wrong sentence samples are obtained by destroying the correct sentence samples based on the preset phonetic similar characters and font similar characters, so that the text error correction network can learn the wrong sentences The conversion relationship between the text and the correct sentence can then guide the rapid error correction of the text content to be corrected, and the error correction efficiency is high.
  • the text error correction network is trained according to the following steps: obtaining correct sentence samples and performing text destruction on the correct sentence samples to obtain wrong sentence samples; combining the wrong sentence samples with the correct sentence There is at least one different character between the samples; by using the wrong sentence sample as the input data of the text error correction network to be trained, and using the correct sentence sample corresponding to the wrong sentence sample as the label of the wrong sentence sample, Perform at least one round of training on the text error correction network.
  • the training of the text error correction network is realized through the comparison between the label and the output result, until the output result points to the correct sentence sample, which means that the training purpose is achieved, that is, the conversion relationship between the wrong sentence and the correct sentence is obtained through training. , the error correction accuracy of the trained text error correction network is also high.
  • the erroneous sentence sample is obtained according to the following steps: obtain a preset candidate character table; the candidate character table includes a plurality of candidate characters, and the pronunciation corresponding to each of the candidate characters Similar characters and characters with similar font shapes; performing text destruction on the correct sentence sample based on the candidate character table to obtain the wrong sentence sample.
  • the text destruction of the correct sentence sample can be realized based on the phonetic similar characters and font similar characters corresponding to the candidate characters, so that in the text error correction stage, errors can be corrected in multiple dimensions such as the phonetic dimension and the font dimension, and the accuracy of error correction can be improved.
  • the performing text destruction on the correct sentence sample based on the candidate character list to obtain the wrong sentence sample includes: performing segmentation processing on the correct sentence sample to obtain multiple Participate: For the first participle in the plurality of participle, look up the candidate character that matches with described first participle from described candidate character list, and use the phonetically similar character or font similarity of the described candidate character that finds out The characters are used to replace the first participle to obtain a replacement result; based on the replacement result, the wrong sentence sample corresponding to the correct sentence sample is determined.
  • the acquiring the text content to be error-corrected includes: received verification content to be error-corrected uploaded by the client, the type of the verification content includes at least one of text and image, The verification content to be corrected includes text content to be corrected.
  • the text content to be corrected when the verification content includes text, includes characters or character strings in the text; and/or, when the verification content includes images In some cases, the text content to be corrected includes characters or character strings in the text recognized from the image by means of character recognition.
  • the method further includes: returning error correction prompt information to the client; the error correction prompt information is used to indicate that the pending The error correction position corresponding to the error correction text content.
  • the user may be prompted with the location to be corrected, so that the user can know the specific error-corrected location for timely correction.
  • the error correction prompt information is also used to provide reference text content corresponding to the erroneous text content in the text content to be corrected, and the method further includes: responding to the The trigger command of the location, displaying the reference text content.
  • Relevant reference text content can be displayed, and the user can select the desired reference text content based on the displayed content, without manual input by the user, saving time and effort.
  • the displaying the reference text content includes: displaying the reference text content at the position to be corrected with a preset display effect; or, using the reference text content to replace the the text content at the position to be corrected, and display the reference text content at the position to be corrected; or, display the text content to be corrected and the reference text content in split screens.
  • the text includes multiple articles; the method further includes: based on the error correction prompt information generated for each of the multiple articles returned to the client within a preset time period , to determine the error correction history information of the client; the error correction history information includes the number of times of error correction in a single piece of verification content, the total number of times of error correction in multiple pieces of verification content, the text content to be corrected corresponding to the same error, and Corresponding to at least one of the article types of the verification content to which the text content to be corrected belongs to the same error; and determining the performance appraisal result for the client according to the error correction history information.
  • the embodiment of the present disclosure also provides a text error correction device, the device includes: an acquisition module, used to acquire text content to be corrected; an error correction module, used to correct text based on the trained text Multi-dimensional text error correction including phonetic dimension and font dimension is performed on the text content to be corrected to obtain error-corrected text content; wherein, the text error correction network is obtained based on error sentence sample training, and the error The sentence samples are obtained by destroying the correct sentence samples based on the preset phonetic similar characters and font similar characters.
  • an embodiment of the present disclosure further provides an electronic device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the The processor communicates with the memory through a bus, and when the machine-readable instructions are executed by the processor, the steps of the text error correction method described in any one of the first aspect and its various implementation modes are executed.
  • the embodiments of the present disclosure further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the implementation of the first aspect and its various implementations can be performed.
  • the embodiments of the present disclosure further provide a computer program, when the program is executed by a processor, it executes the steps of the text error correction method described in any one of the first aspect and its various implementation modes.
  • FIG. 1 shows a flow chart of a text error correction method provided by an embodiment of the present disclosure
  • FIG. 2 shows a schematic diagram of a text error correction device provided by an embodiment of the present disclosure
  • Fig. 3 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.
  • the present disclosure provides a text error correction method, device, electronic equipment and storage medium to improve error correction efficiency.
  • the text error correction method provided by the embodiments of the present disclosure is generally executed by a computer with certain computing power equipment, the computer equipment includes, for example: terminal equipment or server or other processing equipment, the terminal equipment can be user equipment (User Equipment, UE), mobile equipment, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (Personal Digital Assistant) Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc.
  • the text error correction method may be realized by calling a computer-readable instruction stored in a memory by a processor.
  • FIG. 1 is a flowchart of a text error correction method provided by an embodiment of the present disclosure, the method includes steps S101 to S102, wherein:
  • S102 Based on the trained text error correction network, perform multi-dimensional text error correction including the phonetic dimension and font dimension on the text content to be corrected, and obtain the error-corrected text content; wherein, the text error correction network is obtained based on error sentence sample training Yes, the wrong sentence samples are obtained by destroying the correct sentence samples based on the preset phonetic similar characters and font similar characters.
  • the above method for text error correction can be applied to various fields that require text error correction, such as the field of speech recognition, the field of Optical Character Recognition (OCR), the field of new media, and the field of question answering.
  • OCR Optical Character Recognition
  • the text error correction in OCR recognition will be used as an example to illustrate.
  • an embodiment of the present disclosure provides a solution for multi-dimensional text error correction by using a trained text error correction network.
  • the text error correction network can be pre-trained, this significantly improves the error correction efficiency.
  • the above-mentioned text error correction network can perform error correction from multiple dimensions including the phonetic dimension and font dimension, which further improves the accuracy of error correction.
  • the content of the text to be corrected acquired by the embodiments of the present disclosure may also be different.
  • the content of the text to be corrected in the field of plain text recognition, can be the content of the basic article, or it can be the content of a literary work, for example, it can be the content of a certain paragraph in a novel, specifically it can be a character or a string presented in the form.
  • the text content to be corrected in the field of OCR recognition, can be intelligently recognized text content in the image into editable text content, specifically, it can also be presented in the form of characters or character strings.
  • the above-mentioned text content to be corrected may be obtained from the verification content to be corrected uploaded by the received client.
  • the verification content to be corrected may include text or images.
  • the content to be corrected corresponding to the text can be the content of the text, for example, it can be content such as articles and novels;
  • the content to be corrected corresponding to the image can be the text content recognized from the image by means of character recognition , for example, may be text content determined from images with text information such as posters by means of text detection and text recognition.
  • multi-dimensional text error correction can be realized based on the trained text error correction network, and then the corrected text content can be obtained directly.
  • the purpose of text error correction network training in the embodiments of the present disclosure may be to learn the conversion relationship between wrong sentence samples and correct sentence samples.
  • the erroneous sentence samples can be obtained by destroying the correct sentence samples based on the phonetically similar characters and the font-like characters. In this way, when the above conversion relationship is learned, the text error correction network can be used to correct the text content to be corrected.
  • the specific training process of the text error correction network is described in detail. It mainly includes the following steps: obtaining correct sentence samples and destroying the correct sentence samples to obtain the wrong sentence samples. There is at least one error sentence sample between the wrong sentence samples and the correct sentence samples. Different characters; by using the wrong sentence sample as the input data of the text error correction network to be trained, and using the correct sentence sample corresponding to the wrong sentence sample as the label of the wrong sentence sample, at least one round of training is performed on the text error correction network to obtain Trained text error correction network.
  • the wrong sentence samples and the corresponding correct sentence samples can be respectively used as input items and output comparison items of the text error correction network to be trained to realize the training of the text error correction network.
  • the correct sentence sample is used as the output comparison item, which can be used as the supervision information of the corresponding wrong sentence sample to supervise the network training.
  • the wrong sentence sample can be input into the text error correction network to be trained, and then the output result of the network and the correct sentence sample corresponding to the wrong sentence sample (hereinafter also referred to as the training label ) for comparison. If the comparison result indicates that the output result is inconsistent with the training label, the network parameter value of the text error correction network can be adjusted, and the next round of training can be performed based on the adjusted text error correction network until the comparison result indicates that the output result highly matches the training label , get the trained text error correction network.
  • the above-mentioned erroneous sentence samples can be obtained based on the text destruction of the correct sentence samples. Specifically, it can be realized through the following steps: obtaining a preset candidate character list, which includes a plurality of candidate characters and corresponding to each candidate character The phonetically similar characters and the font-like characters; based on the candidate character table, the text of the correct sentence sample is destroyed to obtain the wrong sentence sample.
  • the list of candidate characters may be preset.
  • the candidate character table as prior knowledge, may store candidate characters and phonetically similar characters and font-like characters corresponding to the candidate characters. In this way, for the correct sentence samples to be destroyed, the corresponding wrong sentence samples can be determined based on the above-mentioned prior knowledge.
  • the above candidate character table may be stored using candidate characters as a basic storage unit.
  • Candidate characters can be obtained from existing character sets, or they can be high-frequency characters, common characters, error-prone and confusing characters collected based on different application scenarios.
  • the candidate characters may include characters such as out, send, and place; for the field of driving, the candidate characters may be characters such as car and vehicle. I won't repeat them here.
  • the candidate characters in the candidate character table it is possible to determine which characters are easily confused with the candidate characters (corresponding to similar characters), and which characters have the same or similar phonetics as the candidate characters (corresponding to similar characters). characters), and then establish the corresponding relationship between these phonetically similar characters and font-like characters and corresponding candidate characters. In this way, when it is necessary to destroy the text of the correct sentence sample, it can be destroyed based on this correspondence.
  • the correct sentence sample can be a short sentence with context, usually containing multiple characters.
  • segmentation may be performed first, and then replacement may be performed, so as to realize targeted text destruction operations.
  • the text destruction can be realized through the following steps: the correct sentence sample is segmented to obtain multiple word segmentations; for the first word segmentation in the multiple word segmentations, the candidate character matching the first word segmentation is searched from the candidate character table, and The first participle is replaced by the sound-similar character or the font-similar character corresponding to the searched candidate character to obtain a replacement result; based on the replacement result, an erroneous sentence sample corresponding to the correct sentence sample is determined.
  • the first participle may be one or more optional participle, or one or more specific participle.
  • the replacement result can be the replacement result corresponding to each word in the corresponding one or more word segments, so that the obtained error sentence samples are more diverse, so that the subsequent text error correction network can perform error correction learning.
  • the first participle can be selected based on which characters are confusing and error-prone, so that the generated error sentence samples can be more targeted, and thus can be better adapted to specific scenarios.
  • the segmentation processing in the embodiments of the present disclosure may be implemented based on methods such as dictionaries, statistics, and neural networks.
  • dictionary-based segmentation can be implemented using greedy matching. In practical applications, you can look up the dictionary from the first word at the beginning of the sentence, find out the longest word in the dictionary that starts with that word, and then get the first segmented word.
  • segmentation processing can be realized based on the time-cycle neural network of Long Short-Term Memory (LSTM), and the specific method will not be described here.
  • LSTM Long Short-Term Memory
  • text destruction can also be performed on multiple first participles. Considering that if several consecutive word segments are destroyed, the difficulty of subsequent error correction will be increased, and the accuracy of error correction may be reduced, so multiple first word segment intervals can be set.
  • the intervals between the multiple first participles can be the same or different, which is not limited here. In this way, there will be no error in several consecutive word segmentations, which can not only satisfy the diversity of error samples, but also ensure the accuracy of subsequent error correction.
  • the text error correction method provided by the embodiments of the present disclosure can be applied to the review of content to be published. For example, it may be an editorial manager who conducts management review on articles edited by editorial workers.
  • the text error correction method can also be applied to checking the correctness of related content. For example, it may be to proofread and review the text in the article when the author uploads his own article.
  • the text error correction method can also be applied to various links that require text error correction, and no specific limitation is set here.
  • error correction prompt information may be returned to the client to indicate the position to be corrected corresponding to the text content to be corrected.
  • the user can confirm that a text error has occurred based on the position to be corrected.
  • the selection of candidate text content after error correction can also be realized by displaying the reference text content according to the trigger command for the position to be corrected. In this way, the user can actively modify according to the reference text content.
  • the display of the content of the reference text is various, and may be displayed in combination with special effects.
  • the reference text content may be displayed with a preset display effect at the position to be corrected. For example, it may be to add a strikethrough to the original error text (that is, the text content to be corrected), display the candidate correct text (that is, the reference text content) nearby, and display the correct text through special effects such as pop-up windows and bubble boxes. It is also possible to replace the text content at the position to be corrected with the reference text content, and display the reference text content at the position to be corrected. For example, a candidate correct text replacing an original erroneous text may be highlighted, and the like. It is also possible to display the original error text and the candidate correct text in split screens. In addition, the embodiments of the present disclosure may also adopt other presentation manners, which are not specifically limited here.
  • the text error correction method provided by the embodiments of the present disclosure can not only help users correct text errors, but also perform error statistics based on single/single article errors to realize performance evaluation.
  • the obtained error correction history information may include the number of error corrections of a single verification content, the total number of error corrections of multiple verification contents, the text content to be corrected corresponding to the same error, and the text content to be corrected corresponding to the same error
  • the article type of the verification content and other statistical information. For example, the statistics of the average number of error corrections can be realized in combination with the time period, so as to facilitate the quantitative assessment of relevant personnel.
  • the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process.
  • the specific execution order of each step should be based on its function and possible
  • the inner logic is OK.
  • the embodiment of the present disclosure also provides a text error correction device corresponding to the text error correction method, because the problem-solving principle of the device in the embodiment of the present disclosure is the same as the above-mentioned text error correction method of the embodiment of the present disclosure Similar, therefore, the implementation of the device can refer to the implementation of the method, and repeated descriptions will not be repeated.
  • FIG. 2 it is a schematic diagram of a text error correction device provided by an embodiment of the present disclosure.
  • the device includes: an acquisition module 201 for acquiring text content to be corrected; an error correction module 202 for The text error correction network performs multi-dimensional text error correction including the phonetic dimension and font dimension on the text content to be corrected, and obtains the error-corrected text content.
  • the text error correction network is obtained by training based on the wrong sentence samples, and the wrong sentence samples are obtained by destroying the correct sentence samples based on the preset characters similar in sound and shape.
  • multi-dimensional text error correction can be performed on the text content to be corrected based on the trained text error correction network to obtain the corrected text content. Since the text error correction network is based on training, the wrong sentence samples are obtained by destroying the correct sentence samples based on the preset phonetic similar characters and font similar characters, so that the text error correction network can learn the wrong sentence and the correct The conversion relationship between sentences can guide the rapid error correction of the text content to be error corrected, and the error correction efficiency is high.
  • the above-mentioned device includes a training module 203; the training module 203 is used to train the text error correction network according to the following steps: obtaining correct sentence samples and performing text destruction on the correct sentence samples to obtain wrong sentence samples; There is at least one different character between the sentence sample and the correct sentence sample; by using the wrong sentence sample as the input data of the text error correction network to be trained, and using the correct sentence sample corresponding to the wrong sentence sample as the label of the wrong sentence sample, the The text error correction network performs at least one round of training to obtain a trained text error correction network.
  • the training module 203 is used to obtain the wrong sentence sample according to the following steps: obtain a preset candidate character table; the candidate character table includes a plurality of candidate characters and the phonetic characters corresponding to each candidate character Similar characters and characters with similar fonts; based on the candidate character table, text destruction is performed on correct sentence samples to obtain wrong sentence samples.
  • the training module 203 is used to perform text destruction on the correct sentence sample based on the candidate character table according to the following steps to obtain the wrong sentence sample: segment the correct sentence sample to obtain multiple word segmentation; For the first participle among the multiple participle words, look up the candidate character matching the first participle from the candidate character list, and replace the first participle with the sound-similar characters or characters similar to the shape of the found candidate characters to obtain the replacement result ; Based on the replacement result, determine the wrong sentence sample corresponding to the correct sentence sample.
  • the obtaining module 201 is configured to obtain the text content to be corrected according to the following steps: receiving the verification content to be corrected uploaded by the client, the type of verification content includes at least one of text and image , the verification content to be corrected includes the text content to be corrected.
  • the text content to be error-corrected when the verification content includes text, includes characters or character strings in the text; and/or; when the verification content includes images, the text content to be error-corrected Includes characters or strings of text recognized from images using text recognition.
  • the above-mentioned device further includes: a prompt module 204, configured to return error correction prompt information to the client after obtaining the error-corrected text content; the error correction prompt information is used to indicate the text to be corrected The position to be corrected corresponding to the content.
  • the error correction prompt information is also used to provide reference text content corresponding to the erroneous text content in the text content to be corrected.
  • the trigger command of the display the reference text content.
  • the display module 205 is configured to display the reference text content according to the following steps: display the reference text content with preset display effects at the position to be corrected; or replace the error to be corrected with the reference text content The text content at the position, and display the reference text content at the position to be corrected; or, split the screen to display the text content to be corrected and the reference text content.
  • the text includes multiple articles; the above-mentioned device further includes: an assessment module 206, configured to generate an error correction prompt for each of the multiple articles returned to the client based on a preset time period Information, to determine the error correction history information of the client; the error correction history information includes the number of error corrections in a single verification content, the total number of error corrections in multiple verification contents, the text content to be corrected corresponding to the same error, and the corresponding error At least one of the article types of the verification content that the text content to be corrected belongs to; and determine the performance appraisal result for the client according to the error correction history information.
  • an assessment module 206 configured to generate an error correction prompt for each of the multiple articles returned to the client based on a preset time period Information, to determine the error correction history information of the client; the error correction history information includes the number of error corrections in a single verification content, the total number of error corrections in multiple verification contents, the text content to be corrected corresponding to the same error, and the corresponding error At least one of the article types of the verification
  • the embodiment of the present disclosure also provides an electronic device, as shown in FIG. 3 , which is a schematic structural diagram of the electronic device provided by the embodiment of the present disclosure, including: a processor 301 , a memory 302 , and a bus 303 .
  • the memory 302 stores machine-readable instructions executable by the processor 301 (for example, execution instructions corresponding to the acquisition module 201 and the error correction module 202 in the device in FIG. The communication between them is through the bus 303.
  • the following processing is performed: obtaining the text content to be corrected; Dimension text error correction to obtain the text content after error correction; among them, the text error correction network is obtained by training based on the wrong sentence samples, and the wrong sentence samples are obtained by destroying the correct sentence samples based on the preset phonetic similar characters and font similar characters of.
  • An embodiment of the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the text error correction method described in the above-mentioned method embodiments are executed .
  • the storage medium may be a volatile or non-volatile computer-readable storage medium.
  • An embodiment of the present disclosure also provides a computer program product, the computer program product carries a program code, and the instructions included in the program code can be used to execute the steps of the text error correction method described in the above method embodiment, for details, please refer to The foregoing method embodiments are not described in detail here.
  • the above-mentioned computer program product may be specifically implemented by means of hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. Wait.
  • Embodiments of the present disclosure also provide a computer program, which, when executed by a processor, executes the steps of the text error correction method described in the above-mentioned method embodiments.
  • a computer program which, when executed by a processor, executes the steps of the text error correction method described in the above-mentioned method embodiments.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions are realized in the form of software function units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor.
  • the technical solution of the present disclosure is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

一种文本纠错的方法、装置、电子设备及存储介质。根据该方法的一个示例,在获取待纠错文本内容(S101)后,基于经训练的文本纠错网络对待纠错文本内容进行包括字音维度以及字形维度的多维度文本纠错,得到纠错后的文本内容;其中,文本纠错网络为基于错误语句样本训练得到的,错误语句样本为基于预设的字音相似字符和字形相似字符对正确语句样本进行破坏得到的(S102)。

Description

文本纠错的方法、装置、电子设备及存储介质
相关申请的交叉引用
本公开要求于2021年6月25日提交的、申请号为202110711749.7、发明名称为“一种文本纠错的方法、装置、电子设备及存储介质”的中国专利申请的优先权,该中国专利申请公开的全部内容以引用的方式并入本文中。
技术领域
本公开涉及信息处理技术领域,具体而言,涉及文本纠错的方法、装置、电子设备及存储介质。
背景技术
随着科技的不断发展,文字识别技术,尤其是光学字符识别(Optical Character Recognition,OCR)技术,得到了越来越广泛的应用。OCR识别技术可以从图像中识别出文本内容。然而,由于受到诸如书写字体、外界环境等各方面的影响,导致识别出来的文本内容会出现错字。
文本纠错即是对文本中的错字进行修正的过程。相关技术中可采用人工方式进行修正,这将耗费相关人员的大量时间,纠错效率较低。
发明内容
本公开实施例至少提供一种文本纠错的方法、装置、电子设备及存储介质,以提高纠错效率。
第一方面,本公开实施例提供了一种文本纠错的方法,所述方法包括:获取待纠错文本内容;基于经训练的文本纠错网络对所述待纠错文本内容进行包括字音维度以及字形维度的多维度文本纠错,得到纠错后的文本内容;其中,所述文本纠错网络为基于错误语句样本训练得到的,所述错误语句样本为基于预设的字音相似字符和字形相似字符对正确语句样本进行破坏得到的。
采用上述文本纠错的方法,在获取到待纠错文本内容的情况下,可以基于经训练的文本纠错网络对待纠错文本内容进行多维度文本纠错,以得到纠错后的文本内容。由 于文本纠错网络是基于错误语句样本训练得到的,错误语句样本是基于预设的字音相似字符和字形相似字符对正确语句样本进行破坏得到的,这样,该文本纠错网络可以学习到错误语句与正确语句之间的转换关系,进而可以指导待纠错文本内容的快速纠错,且纠错效率较高。
在一种可能的实施方式中,按照如下步骤训练所述文本纠错网络:获取正确语句样本以及对所述正确语句样本进行文本破坏得到的错误语句样本;所述错误语句样本与所述正确语句样本之间至少存在一个不同的字符;通过将所述错误语句样本作为待训练的文本纠错网络的输入数据,并将所述错误语句样本对应的正确语句样本作为所述错误语句样本的标签,对所述文本纠错网络进行至少一轮训练。
通过标签与输出结果之间的对比结果实现对文本纠错网络的训练,直至输出结果指向的是正确语句样本,说明达到训练目的,也即,训练得到了错误语句与正确语句之间的转换关系,训练得到的文本纠错网络的纠错准确度也较高。
在一种可能的实施方式中,按照如下步骤获取所述错误语句样本:获取预设的候选字符表;所述候选字符表包括有多个候选字符、以及与每个所述候选字符对应的字音相似字符和字形相似字符;基于所述候选字符表对所述正确语句样本进行文本破坏,得到所述错误语句样本。
可以基于候选字符对应的字音相似字符以及字形相似字符实现对正确语句样本的文本破坏,从而能够在文本纠错阶段可以就字音维度以及字形维度等多个维度进行纠错,提升纠错的准确度。
在一种可能的实施方式中,所述基于所述候选字符表对所述正确语句样本进行文本破坏,得到所述错误语句样本,包括:对所述正确语句样本进行切分处理,得到多个分词;针对所述多个分词中的第一分词,从所述候选字符表中查找与所述第一分词匹配的候选字符,并利用查找到的所述候选字符对应的字音相似字符或字形相似字符对所述第一分词进行替换,得到替换结果;基于所述替换结果,确定所述正确语句样本对应的所述错误语句样本。
可以在切分处理的基础之上,基于字音相似字符或字形相似字符的替换操作实现文本破坏,这样所破坏得到的错误语句样本与正确语句样本存在一定程度上的相似性,基于这种相似性所训练出来的文本纠错网络可以很好的进行纠错。
在一种可能的实施方式中,所述获取待纠错文本内容,包括:接收到的客户端上 传的待纠错的核验内容,所述核验内容的类型包括文本和图像中的至少一项,所述待纠错的核验内容包括待纠错文本内容。
在一种可能的实施方式中,在所述核验内容包括文本的情况下,所述待纠错文本内容包括所述文本中的字符或字符串;和/或,在所述核验内容包括图像的情况下,所述待纠错文本内容包括利用文字识别方式从所述图像中识别出的文本中的字符或字符串。
在一种可能的实施方式中,在所述得到纠错后的文本内容之后,所述方法还包括:向所述客户端返回纠错提示信息;所述纠错提示信息用于指示所述待纠错文本内容对应的待纠错位置。
在确定纠错后的文本内容的情况下,可以基于纠错提示信息的呈现,向用户提示待纠错位置,便于用户了解具体的纠错位置,以便及时修正。
在一种可能的实施方式中,所述纠错提示信息还用于提供与所述待纠错文本内容中错误文本内容对应的参考文本内容,所述方法还包括:响应针对所述待纠错位置的触发指令,展示参考文本内容。
可以进行相关参考文本内容的展示,用户则可以基于展示内容选取想要的参考文本内容,无需用户手动输入,省时省力。
在一种可能的实施方式中,所述展示参考文本内容,包括:在所述待纠错位置处,以预设显示特效展示所述参考文本内容;或者,利用所述参考文本内容替换所述待纠错位置处的文本内容,并在所述待纠错位置处展示所述参考文本内容;或者,分屏展示所述待纠错文本内容和所述参考文本内容。
在一种可能的实施方式中,所述文本包括多篇文章;所述方法还包括:基于预设时间段内,针对返给所述客户端的多篇文章中每篇文章产生的纠错提示信息,确定所述客户端的纠错历史信息;所述纠错历史信息包括单篇核验内容中的纠错次数、多篇核验内容中的纠错总次数、同一错误所对应的待纠错文本内容及对应同一错误的所述待纠错文本内容所属核验内容的文章类型中的至少一种;根据所述纠错历史信息,确定针对所述客户端的绩效考核结果。
第二方面,本公开实施例还提供了一种文本纠错的装置,所述装置包括:获取模块,用于获取待纠错文本内容;纠错模块,用于基于经训练的文本纠错网络对所述待纠错文本内容进行包括字音维度以及字形维度的多维度文本纠错,得到纠错后的文本内容;其中,所述文本纠错网络为基于错误语句样本训练得到的,所述错误语句样本为基于预 设的字音相似字符和字形相似字符对正确语句样本进行破坏得到的。
第三方面,本公开实施例还提供了一种电子设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时,执行如第一方面及其各种实施方式任一所述的文本纠错的方法的步骤。
第四方面,本公开实施例还提供了一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时,执行如第一方面及其各种实施方式任一所述的文本纠错的方法的步骤。
第五方面,本公开实施例还提供了一种计算机程序,所述程序被处理器执行时,执行如第一方面及其各种实施方式任一所述的文本纠错的方法的步骤。
关于上述文本纠错的装置、电子设备、计算机可读存储介质、及计算机程序的效果描述参见下述文本纠错的方法的说明,这里不再赘述。
为使本公开的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍。这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。应当理解,以下附图仅示出了本公开的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1示出了本公开实施例所提供的一种文本纠错的方法的流程图;
图2示出了本公开实施例所提供的一种文本纠错的装置的示意图;
图3示出了本公开实施例所提供的一种电子设备的示意图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施 例仅仅是本公开一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本公开实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本公开的实施例的详细描述并非旨在限制要求保护的本公开的范围,而是仅仅表示本公开的选定实施例。基于本公开的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。
本文中术语“和/或”,仅仅是描述一种关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。
经研究发现,文本纠错即是对文本中的错字进行修正的过程。相关技术中可采用人工方式进行修正,这将耗费相关人员的大量时间,纠错效率较低。
基于上述研究,本公开提供了一种文本纠错的方法、装置、电子设备及存储介质,以提高纠错效率。
为便于对本实施例进行理解,首先对本公开实施例所公开的一种文本纠错的方法进行详细介绍,本公开实施例所提供的文本纠错的方法的执行主体一般为具有一定计算能力的计算机设备,该计算机设备例如包括:终端设备或服务器或其它处理设备,终端设备可以为用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字助理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备等。在一些可能的实现方式中,该文本纠错的方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。
参见图1所示,为本公开实施例提供的文本纠错的方法的流程图,方法包括步骤S101~S102,其中:
S101:获取待纠错文本内容;
S102:基于经训练的文本纠错网络对待纠错文本内容进行包括字音维度以及字形维度的多维度文本纠错,得到纠错后的文本内容;其中,文本纠错网络为基于错误语句样本训练得到的,错误语句样本为基于预设的字音相似字符和字形相似字符对正确语句 样本进行破坏得到的。
为了便于理解本公开实施例提供的文本纠错的方法,接下来首先对该方法的应用场景进行简单介绍。上述文本纠错的方法可以应用于语音识别领域、光学字符识别(Optical Character Recognition,OCR)领域、新媒体领域、问答领域等各种需要进行文本纠错的领域中。考虑到OCR识别的广泛应用,接下来多以OCR识别中的文字纠错进行示例说明。
相关技术中的人工修正方式无法确保纠错的效率,这很大程度上会影响文字纠错技术在各个技术领域中的广泛应用。为了提升纠错效率,本公开实施例提供了一种利用训练好的文字纠错网络进行多维度文本纠错的方案。其中,由于文字纠错网络可以是预先训练的,这显著提升了纠错效率。与此同时,上述文字纠错网络可以从包括字音维度以及字形维度的多个维度进行纠错,这进一步提升了纠错的准确度。
其中,基于不同的应用场景,本公开实施例获取的待纠错文本内容也可以不同。例如,在纯文本识别领域,待纠错文本内容可以是基础文章上的内容,也可以是文学作品上的内容,例如,可以是小说上的某一个段落内容,具体可以是以字符或字符串的形式来呈现。再如,在OCR识别领域,待纠错文本内容可以是将图像中的文字内容智能识别成为可编辑的文本内容,具体也可以是以字符或字符串的形式来呈现。
在具体应用中,上述待纠错文本内容可以从接收到的客户端上传的待纠错的核验内容中获取的。待纠错的核验内容可以包括文本,也可以包括图像。其中,与文本对应的待纠错内容可以是文本的内容,例如,可以是文章、小说之类的内容;与图像对应的待纠错内容可以是利用文字识别方式从图像中识别出的文本内容,例如,可以是通过文字检测和文字识别的手段从海报等具备文本信息的图像中所确定的文本内容。
在获取到待纠错文本内容的情况下,可以基于经训练的文本纠错网络实现多维度文本纠错,进而可以直接得出纠错后的文本内容。
本公开实施例中的文本纠错网络训练的目的可以是学习错误语句样本与正确语句样本之间的转换关系。错误语句样本可以是基于字音相似字符和字形相似字符对正确语句样本进行破坏得到的。这样,在学习到上述转换关系的情况下,即可以利用文本纠错网络对待纠错文本内容进行纠错。
需要说明的是,由于文本破坏包括基于字音相似字符以及字形相似字符两个层面的破坏,进而可以对待纠错文本内容实现字音维度和字形维度等至少两个维度的纠错。 也即,不管待纠错文本内容中是存在字音错误还是字形错误,本公开实施例均可以进行高效且准确的纠错。
仍以OCR识别领域为例。例如,在待纠错文本内容为“眼睛蛇”的情况下,可知存在有字音错误,并可以就“睛”和“镜”之间的字音相似来实现纠错。再如,在待纠错文本内容为“缺铁性盆血”的情况下,可知存在有字形错误,并可以就“盆”和“贫”之间的字形相似来实现纠错。
接下来对文本纠错网络的具体训练过程进行详细描述,主要包括如下步骤:获取正确语句样本以及对正确语句样本进行文本破坏得到的错误语句样本,错误语句样本与正确语句样本之间至少存在一个不同的字符;通过将错误语句样本作为待训练的文本纠错网络的输入数据、并将错误语句样本对应的正确语句样本作为错误语句样本的标签,对文本纠错网络进行至少一轮训练,得到训练好的文本纠错网络。
通过预先获取正确语句样本以及错误语句样本,可以将错误语句样本和对应的正确语句样本分别作为待训练的文本纠错网络的输入项和输出比对项,来实现文本纠错网络的训练。其中,正确语句样本作为输出比对项,可以作为对应的错误语句样本的监督信息来监督网络训练。
在训练文本纠错网络的过程中,可以将错误语句样本输入到待训练的文本纠错网络,而后将网络的输出结果与这一错误语句样本对应的正确语句样本(以下也可称为训练标签)进行比对。若对比结果表示输出结果与训练标签不一致,则可以调整文本纠错网络的网络参数值,并可以基于调整后的文本纠错网络进行下一轮训练,直至对比结果表示输出结果与训练标签高度匹配,得到训练好的文本纠错网络。
上述错误语句样本可以为基于对正确语句样本的文本破坏得到的,具体可以通过如下步骤来实现:获取预设的候选字符表,候选字符表包括有多个候选字符、以及与每个候选字符对应的字音相似字符和字形相似字符;基于候选字符表对正确语句样本进行文本破坏,得到错误语句样本。
候选字符表可以是预先设置的。候选字符表作为先验知识,可以存储有候选字符以及与该候选字符对应的字音相似字符和字形相似字符。这样,针对待破坏的正确语句样本,可以基于上述先验知识确定对应的错误语句样本。
上述候选字符表可以是以候选字符为基本存储单位进行存储的。候选字符可以是从已有字符集中获取的,也可以是基于不同的应用场景收集的高频字符、常用字符、易 错易混淆字符等。例如,针对语音识别领域,候选字符可以包括出、发、地等字符;针对驾驶领域,候选字符可以是车、辆等字符。在此不再赘述。
在本公开实施例中,可以针对候选字符表中的候选字符,确定哪些字的字形与该候选字符容易混淆(对应字形相似字符)、哪些字的字音与该候选字符相同或相近(对应字音相似字符),进而建立这些字音相似字符和字形相似字符与对应候选字符之间的对应关系。这样,在需要对正确语句样本进行文本破坏的情况下,即可以基于这一对应关系进行破坏。
为了满足对于待纠错文本内容的纠错,正确语句样本可以是具有上下文的一个短句,通常包含多个字符。本公开实施例中,在基于上述候选字符表中的候选字符进行文本破坏的情况下,可以先进行切分,再进行替换,以实现针对性的文本破坏操作。
具体可通过如下步骤来实现文本破坏:对正确语句样本进行切分处理,得到多个分词;针对多个分词中的第一分词,从候选字符表中查找与第一分词匹配的候选字符,并利用查找到的候选字符对应的字音相似字符或字形相似字符对第一分词进行替换,得到替换结果;基于替换结果,确定正确语句样本对应的错误语句样本。
第一分词可以是任选的一个或多个分词,或者是特定的一个或多个分词。替换结果则可以是对应的一个或多个分词中每个分词所对应的替换结果,从而使得所得到的错误语句样本更具多样性,以便于后续文本纠错网络进行纠错学习。
需要说明的是,可以基于哪些是易混淆易错的字符进行第一分词的选取,这样所生成的错误语句样本可以更具有针对性,从而可以更好的适应于特定场景。
本公开实施例中的切分处理可以是基于词典、统计、神经网络等方法实现的。例如,可以采用greedy匹配实现基于词典的切分。在实际应用中,可以从句子开头的第一个字开始查字典,找出字典中以该字开头的最长的单词,然后就得到了第一个切分好的词。再如,可以采用全局出发实现基于统计的切分。在实际应用中,可以在各种切词组合中找出最合理的组合,相当于在切分词图中找出一条概率最大的路径。再如,可以基于长短期记忆网络(Long Short-Term Memory,LSTM)这一时间循环神经网络实现切分处理,具体方法在此不再赘述。
不管是基于上述哪种方式实现的切分处理,在进行文本破坏的过程中,对于一个由多个字符所构成的第一分词而言,往往不会破坏整个分词,而是可以选取其中的一个字符来进行破坏,以更好的适应纠错需求。
除此之外,为了更好的适应纠错需求,也可以对多个第一分词进行文本破坏。考虑到若是对连续几个分词都进行破坏,将会增加后续进行纠错的难度,纠错的准确性有可能会降低,可以将多个第一分词间隔设置。多个第一分词之间的间隔可以相同或是不同,在此不予限定。这样就不会出现连续几个分词都出错的情况,既能满足错误样本的多样性,也能确保后续纠错的准确性。
本公开实施例提供的文本纠错的方法可以应用到待发表内容的审核。例如,可以是编辑管理者对编辑工作者已编辑好的文章进行管理审核。该文本纠错的方法也可以应用到对相关内容的对错进行检查。例如,可以是在作者上传自己文章的过程中,对文章中的文本进行校对审核。除此之外,该文本纠错的方法还可以应用到各种需要进行文本纠错的环节中,在此不做具体的限制。
在得到纠错后的文本内容之后,可以向客户端返回纠错提示信息以指示待纠错文本内容对应的待纠错位置。用户基于待纠错位置可以确认发生了文本错误。除此之外,还可以通过针对待纠错位置的触发指令,展示参考文本内容这一方式实现候选的纠错后文本内容的选择,这样,用户可以按参考文本内容进行主动性修改。
其中,有关参考文本内容的展示是多种多样的,可以结合特效来进行展示。
具体的,可以是在待纠错位置处,以预设显示特效展示参考文本内容。例如,可以是对于原有错误文本(即待纠错文本内容)增加删除线,在附近显示候选的正确文本(即参考文本内容),再如通过弹窗、气泡框等特效方式显示正确文本。还可以是利用参考文本内容替换待纠错位置处的文本内容,并在待纠错位置处展示参考文本内容。例如,可以高亮显示替换了原有错误文本的候选的正确文本等。还可以是分屏展示原始的错误文本和候选的正确文本。除此之外,本公开实施例还可以采用其它展示方式,在此不做具体的限制。
本公开实施例提供的文本纠错的方法在帮助用户纠正文本错误的同时,还可以基于单次/单篇的错误情况,进行错误统计以实现绩效考核。
所得到的纠错历史信息可以包括单篇核验内容的纠错次数、多篇核验内容的纠错总次数、同一错误所对应的待纠错文本内容、对应同一错误的所述待纠错文本内容所属核验内容的文章类型、以及其它统计信息。例如,可以结合时间段实现纠错平均次数的统计,以便于对相关人员进行定量化的考核。
例如,对于编辑工作者,若编辑的单篇文章中出现的纠错次数越多,一定程度上 说明其马虎;若其编辑的多篇文章中出现的纠错次数越多,一定程度上说明其工作态度不端正;若其产生同一错误的不同文章所对应的待纠错文本内容的文本类型为同一类型,一定程度上说明其对这一文章所对应领域了解不充分。在及时了解到这些统计信息的情况下,可以针对编辑工作者给予更针对性的管理对策。
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。
基于同一发明构思,本公开实施例中还提供了与文本纠错的方法对应的文本纠错的装置,由于本公开实施例中的装置解决问题的原理与本公开实施例上述文本纠错的方法相似,因此装置的实施可以参见方法的实施,重复之处不再赘述。
参照图2所示,为本公开实施例提供的一种文本纠错的装置的示意图,装置包括:获取模块201,用于获取待纠错文本内容;纠错模块202,用于基于经训练的文本纠错网络对待纠错文本内容进行包括字音维度以及字形维度的多维度文本纠错,得到纠错后的文本内容。其中,文本纠错网络为基于错误语句样本训练得到的,错误语句样本为基于预设的字音相似字符和字形相似字符对正确语句样本进行破坏得到的。
采用上述文本纠错的装置,在获取到待纠错文本内容的情况下,可以基于经训练的文本纠错网络对待纠错文本内容进行多维度文本纠错,以得到纠错后的文本内容。由于文本纠错网络是基于经训练得到的,错误语句样本是基于预设的字音相似字符和字形相似字符对正确语句样本进行破坏得到的,这样,该文本纠错网络可以学习到错误语句与正确语句之间的转换关系,进而可以指导待纠错文本内容的快速纠错,且纠错效率较高。
在一种可能的实施方式中,上述装置包括训练模块203;训练模块203,用于按照如下步骤训练文本纠错网络:获取正确语句样本以及对正确语句样本进行文本破坏得到的错误语句样本;错误语句样本与正确语句样本之间至少存在一个不同的字符;通过将错误语句样本作为待训练的文本纠错网络的输入数据,并将错误语句样本对应的正确语句样本作为错误语句样本的标签,对文本纠错网络进行至少一轮训练,得到训练好的文本纠错网络。
在一种可能的实施方式中,训练模块203,用于按照如下步骤获取错误语句样本:获取预设的候选字符表;候选字符表包括有多个候选字符、以及与每个候选字符对应的 字音相似字符和字形相似字符;基于候选字符表对正确语句样本进行文本破坏,得到错误语句样本。
在一种可能的实施方式中,训练模块203,用于按照以下步骤基于候选字符表对正确语句样本进行文本破坏,得到错误语句样本:对正确语句样本进行切分处理,得到多个分词;针对多个分词中的第一分词,从候选字符表中查找与第一分词匹配的候选字符,并利用查找到的候选字符对应的字音相似字符或字形相似字符对第一分词进行替换,得到替换结果;基于替换结果,确定正确语句样本对应的错误语句样本。
在一种可能的实施方式中,获取模块201,用于按照以下步骤获取待纠错文本内容:接收客户端上传的待纠错的核验内容,核验内容的类型包括文本和图像中的至少一项,待纠错的核验内容包括待纠错文本内容。
在一种可能的实施方式中,在核验内容包括文本的情况下,待纠错文本内容包括文本中的字符或字符串;和/或;在核验内容包括图像的情况下,待纠错文本内容包括利用文字识别方式从图像中识别出的文本中的字符或字符串。
在一种可能的实施方式中,上述装置还包括:提示模块204,用于在得到纠错后的文本内容之后,向客户端返回纠错提示信息;纠错提示信息用于指示待纠错文本内容对应的待纠错位置。
在一种可能的实施方式中,纠错提示信息还用于提供与待纠错文本内容中错误文本内容对应的参考文本内容,上述装置还包括:展示模块205,用于响应针对待纠错位置的触发指令,展示参考文本内容。
在一种可能的实施方式中,展示模块205,用于按照以下步骤展示参考文本内容:在待纠错位置处,以预设显示特效展示参考文本内容;或者,利用参考文本内容替换待纠错位置处的文本内容,并在待纠错位置处展示参考文本内容;或者,分屏展示待纠错文本内容和参考文本内容。
在一种可能的实施方式中,文本包括多篇文章;上述装置还包括:考核模块206,用于基于预设时间段内,针对返给客户端的多篇文章中每篇文章产生的纠错提示信息,确定客户端的纠错历史信息;纠错历史信息包括单篇核验内容中的纠错次数、多篇核验内容中的纠错总次数、同一错误所对应的待纠错文本内容及对应同一错误的待纠错文本内容所属核验内容的文章类型中的至少一种;根据纠错历史信息,确定针对客户端的绩效考核结果。
关于装置中的各模块的处理流程、以及各模块之间的交互流程的描述可以参照上述方法实施例中的相关说明,这里不再详述。
本公开实施例还提供了一种电子设备,如图3所示,为本公开实施例提供的电子设备结构示意图,包括:处理器301、存储器302、和总线303。存储器302存储有处理器301可执行的机器可读指令(比如,图2中的装置中获取模块201、纠错模块202对应的执行指令等),当电子设备运行时,处理器301与存储器302之间通过总线303通信,机器可读指令被处理器301执行时执行如下处理:获取待纠错文本内容;基于经训练的文本纠错网络对待纠错文本内容进行包括字音维度以及字形维度的多维度文本纠错,得到纠错后的文本内容;其中,文本纠错网络为基于错误语句样本训练得到的,错误语句样本为基于预设的字音相似字符和字形相似字符对正确语句样本进行破坏得到的。
本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述方法实施例中所述的文本纠错的方法的步骤。其中,该存储介质可以是易失性或非易失的计算机可读取存储介质。
本公开实施例还提供一种计算机程序产品,该计算机程序产品承载有程序代码,所述程序代码包括的指令可用于执行上述方法实施例中所述的文本纠错的方法的步骤,具体可参见上述方法实施例,在此不再赘述。
其中,上述计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品具体体现为计算机存储介质,在另一个可选实施例中,计算机程序产品具体体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。
本公开实施例还提供一种计算机程序,所述程序被处理器执行时,执行上述方法实施例中所述的文本纠错的方法的步骤,具体可参见上述方法实施例,在此不再赘述。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的***和装置的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。在本公开所提供的几个实施例中,应该理解到,所揭露的***、装置和方法,可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,又例如,多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示 或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上所述实施例,仅为本公开的具体实施方式,用以说明本公开的技术方案,而非对其限制,本公开的保护范围并不局限于此,尽管参照前述实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化,或者对其中部分技术特征进行等同替换;而这些修改、变化或者替换,并不使相应技术方案的本质脱离本公开实施例技术方案的精神和范围,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应所述以权利要求的保护范围为准。

Claims (13)

  1. 一种文本纠错的方法,其特征在于,所述方法包括:
    获取待纠错文本内容;
    基于经训练的文本纠错网络对所述待纠错文本内容进行包括字音维度以及字形维度的多维度文本纠错,得到纠错后的文本内容;
    其中,所述文本纠错网络为基于错误语句样本训练得到的,所述错误语句样本为基于预设的字音相似字符和字形相似字符对正确语句样本进行破坏得到的。
  2. 根据权利要求1所述的方法,其特征在于,按照如下步骤训练所述文本纠错网络:
    获取正确语句样本以及对所述正确语句样本进行文本破坏得到的错误语句样本;所述错误语句样本与所述正确语句样本之间至少存在一个不同的字符;
    通过将所述错误语句样本作为待训练的文本纠错网络的输入数据,并将所述错误语句样本对应的正确语句样本作为所述错误语句样本的标签,对所述文本纠错网络进行至少一轮训练。
  3. 根据权利要求1或2所述的方法,其特征在于,按照如下步骤获取所述错误语句样本:
    获取预设的候选字符表;所述候选字符表包括有多个候选字符、以及与每个所述候选字符对应的字音相似字符和字形相似字符;
    基于所述候选字符表对所述正确语句样本进行文本破坏,得到所述错误语句样本。
  4. 根据权利要求3所述的方法,其特征在于,所述基于所述候选字符表对所述正确语句样本进行文本破坏,得到所述错误语句样本,包括:
    对所述正确语句样本进行切分处理,得到多个分词;
    针对所述多个分词中的第一分词,从所述候选字符表中查找与所述第一分词匹配的候选字符,并利用查找到的所述候选字符对应的字音相似字符或字形相似字符对所述第一分词进行替换,得到替换结果;
    基于所述替换结果,确定所述正确语句样本对应的所述错误语句样本。
  5. 根据权利要求1至4任一所述的方法,其特征在于,所述获取待纠错文本内容,包括:
    接收客户端上传的待纠错的核验内容,
    其中,所述核验内容的类型包括文本和图像中的至少一项,
    所述待纠错文本内容包括所述文本和利用文字识别方式从所述图像识别出的文本中至少一项的字符或字符串。
  6. 根据权利要求5或6所述的方法,其特征在于,在所述得到纠错后的文本内容之后,所述方法还包括:
    向所述客户端返回纠错提示信息;
    其中,所述纠错提示信息用于指示所述核验内容中所述待纠错文本内容对应的待纠错位置。
  7. 根据权利要求6所述的方法,其特征在于,所述纠错提示信息还用于提供与所述待纠错文本内容中错误文本内容对应的参考文本内容,所述方法还包括:
    响应针对所述待纠错位置的触发指令,展示所述参考文本内容。
  8. 根据权利要求7所述的方法,其特征在于,所述展示所述参考文本内容,包括以下任一:
    在所述待纠错位置处,以预设显示特效展示所述参考文本内容;
    利用所述参考文本内容替换所述待纠错位置处的文本内容,并在所述待纠错位置处展示所述参考文本内容;
    分屏展示所述待纠错文本内容和所述参考文本内容。
  9. 根据权利要求6至8任一所述的方法,其特征在于,所述方法还包括:
    基于预设时间段内,基于针对所述客户端上传的多篇核验内容中每篇核验内容产生的所述纠错提示信息,确定所述客户端的纠错历史信息;所述纠错历史信息包括单篇核验内容的纠错次数、多篇核验内容的纠错总次数、同一错误所对应的待纠错文本内容、 及对应同一错误的所述待纠错文本内容所属核验内容的文章类型中的至少一种;
    根据所述纠错历史信息,确定针对所述客户端的绩效考核结果。
  10. 一种文本纠错的装置,其特征在于,所述装置包括:
    获取模块,用于获取待纠错文本内容;
    纠错模块,用于基于经训练的文本纠错网络对所述待纠错文本内容进行包括字音维度以及字形维度的多维度文本纠错,得到纠错后的文本内容;
    其中,所述文本纠错网络为基于错误语句样本训练得到的,所述错误语句样本为基于预设的字音相似字符和字形相似字符对正确语句样本进行破坏得到的。
  11. 一种电子设备,其特征在于,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时,执行如权利要求1至9任一所述的文本纠错的方法的步骤。
  12. 一种计算机可读存储介质,其特征在于,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时,执行如权利要求1至9任一所述的文本纠错的方法的步骤。
  13. 一种计算机程序,其特征在于,所述程序被处理器执行时,执行如权利要求1至9任一所述的文本纠错的方法的步骤。
PCT/CN2021/134638 2021-06-25 2021-11-30 文本纠错的方法、装置、电子设备及存储介质 WO2022267353A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110711749.7A CN113343678A (zh) 2021-06-25 2021-06-25 一种文本纠错的方法、装置、电子设备及存储介质
CN202110711749.7 2021-06-25

Publications (1)

Publication Number Publication Date
WO2022267353A1 true WO2022267353A1 (zh) 2022-12-29

Family

ID=77478919

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/134638 WO2022267353A1 (zh) 2021-06-25 2021-11-30 文本纠错的方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN113343678A (zh)
WO (1) WO2022267353A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306598A (zh) * 2023-05-22 2023-06-23 上海蜜度信息技术有限公司 针对不同领域字词的定制化纠错方法、***、设备及介质
CN116719424A (zh) * 2023-08-09 2023-09-08 腾讯科技(深圳)有限公司 一种类型识别模型的确定方法及相关装置
CN117094311A (zh) * 2023-10-19 2023-11-21 山东齐鲁壹点传媒有限公司 一种关于中文语法纠错的误纠过滤器的建立方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343678A (zh) * 2021-06-25 2021-09-03 北京市商汤科技开发有限公司 一种文本纠错的方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140214401A1 (en) * 2013-01-29 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method and device for error correction model training and text error correction
CN111611791A (zh) * 2020-04-27 2020-09-01 鼎富智能科技有限公司 一种文本处理的方法及相关装置
CN112016310A (zh) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 文本纠错方法、***、设备及可读存储介质
CN112560450A (zh) * 2020-12-11 2021-03-26 科大讯飞股份有限公司 一种文本纠错方法及装置
CN112926306A (zh) * 2021-03-08 2021-06-08 北京百度网讯科技有限公司 文本纠错方法、装置、设备以及存储介质
CN113343678A (zh) * 2021-06-25 2021-09-03 北京市商汤科技开发有限公司 一种文本纠错的方法、装置、电子设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI391832B (zh) * 2008-09-09 2013-04-01 Inst Information Industry 中文文章偵錯裝置、中文文章偵錯方法以及儲存媒體
CN109543022B (zh) * 2018-12-17 2020-10-13 北京百度网讯科技有限公司 文本纠错方法和装置
CN112396049A (zh) * 2020-11-19 2021-02-23 平安普惠企业管理有限公司 文本纠错方法、装置、计算机设备及存储介质
CN112597753A (zh) * 2020-12-22 2021-04-02 北京百度网讯科技有限公司 文本纠错处理方法、装置、电子设备和存储介质
CN112784582A (zh) * 2021-02-09 2021-05-11 中国工商银行股份有限公司 纠错方法、装置和计算设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140214401A1 (en) * 2013-01-29 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method and device for error correction model training and text error correction
CN111611791A (zh) * 2020-04-27 2020-09-01 鼎富智能科技有限公司 一种文本处理的方法及相关装置
CN112016310A (zh) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 文本纠错方法、***、设备及可读存储介质
CN112560450A (zh) * 2020-12-11 2021-03-26 科大讯飞股份有限公司 一种文本纠错方法及装置
CN112926306A (zh) * 2021-03-08 2021-06-08 北京百度网讯科技有限公司 文本纠错方法、装置、设备以及存储介质
CN113343678A (zh) * 2021-06-25 2021-09-03 北京市商汤科技开发有限公司 一种文本纠错的方法、装置、电子设备及存储介质

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306598A (zh) * 2023-05-22 2023-06-23 上海蜜度信息技术有限公司 针对不同领域字词的定制化纠错方法、***、设备及介质
CN116306598B (zh) * 2023-05-22 2023-09-08 上海蜜度信息技术有限公司 针对不同领域字词的定制化纠错方法、***、设备及介质
CN116719424A (zh) * 2023-08-09 2023-09-08 腾讯科技(深圳)有限公司 一种类型识别模型的确定方法及相关装置
CN116719424B (zh) * 2023-08-09 2024-03-22 腾讯科技(深圳)有限公司 一种类型识别模型的确定方法及相关装置
CN117094311A (zh) * 2023-10-19 2023-11-21 山东齐鲁壹点传媒有限公司 一种关于中文语法纠错的误纠过滤器的建立方法
CN117094311B (zh) * 2023-10-19 2024-01-26 山东齐鲁壹点传媒有限公司 一种关于中文语法纠错的误纠过滤器的建立方法

Also Published As

Publication number Publication date
CN113343678A (zh) 2021-09-03

Similar Documents

Publication Publication Date Title
WO2022267353A1 (zh) 文本纠错的方法、装置、电子设备及存储介质
KR102199835B1 (ko) 언어 교정 시스템 및 그 방법과, 그 시스템에서의 언어 교정 모델 학습 방법
US20200184953A1 (en) Method, device, and storage medium for correcting error in speech recognition result
US20120166942A1 (en) Using parts-of-speech tagging and named entity recognition for spelling correction
TWI567569B (zh) Natural language processing systems, natural language processing methods, and natural language processing programs
CN111310447A (zh) 语法纠错方法、装置、电子设备和存储介质
US10963717B1 (en) Auto-correction of pattern defined strings
KR101633556B1 (ko) 문법 오류 수정 장치 및 이를 이용한 문법 오류 수정 방법
JP5502814B2 (ja) アラビア語テキストに発音区別符号を付与するための方法およびシステム
US20120035909A1 (en) Conversion of alphabetic words into a plurality of independent spellings
CN113255331B (zh) 文本纠错方法、装置及存储介质
Zhao et al. A hybrid model for Chinese spelling check
CN104239289A (zh) 音节划分方法和音节划分设备
Lee et al. Automatic word spacing using probabilistic models based on character n-grams
US10515148B2 (en) Arabic spell checking error model
CN107783958B (zh) 一种目标语句识别方法及装置
Kaur et al. Hybrid approach for spell checker and grammar checker for Punjabi
CN112182353A (zh) 用于信息搜索的方法、电子设备和存储介质
Wang et al. Conditional Random Field-based Parser and Language Model for Tradi-tional Chinese Spelling Checker
CN109002454B (zh) 一种确定目标单词的拼读分区的方法和电子设备
US10789410B1 (en) Identification of source languages for terms
CN111368547A (zh) 基于语义解析的实体识别方法、装置、设备和存储介质
US10755594B2 (en) Method and system for analyzing a piece of text
Hladek et al. Unsupervised spelling correction for Slovak
CN114676699A (zh) 实体情感分析方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21946830

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21946830

Country of ref document: EP

Kind code of ref document: A1