CN114970502A - Text error correction method applied to digital government - Google Patents

Text error correction method applied to digital government Download PDF

Info

Publication number
CN114970502A
CN114970502A CN202111633076.4A CN202111633076A CN114970502A CN 114970502 A CN114970502 A CN 114970502A CN 202111633076 A CN202111633076 A CN 202111633076A CN 114970502 A CN114970502 A CN 114970502A
Authority
CN
China
Prior art keywords
confusion
error correction
text
result
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111633076.4A
Other languages
Chinese (zh)
Other versions
CN114970502B (en
Inventor
吴琼
常诚
王元卓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Science And Technology Big Data Research Institute
Original Assignee
China Science And Technology Big Data Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Science And Technology Big Data Research Institute filed Critical China Science And Technology Big Data Research Institute
Priority to CN202111633076.4A priority Critical patent/CN114970502B/en
Publication of CN114970502A publication Critical patent/CN114970502A/en
Application granted granted Critical
Publication of CN114970502B publication Critical patent/CN114970502B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • G06F40/146Coding or compression of tree-structured data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Tourism & Hospitality (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Marketing (AREA)
  • Human Resources & Organizations (AREA)
  • Educational Administration (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Data Mining & Analysis (AREA)
  • Development Economics (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of computers, and particularly relates to a text error correction method applied to a digital government, which comprises a method and a flow of model training, data acquisition, data cleaning, text error correction and data storage, wherein character voice, font and characters are used as characteristics and added into a pre-training model for training, so that the error correction accuracy rate with similar character voice and similar font can be improved, the workload of supervision and detection personnel is effectively reduced, the model error correction accuracy rate is about 70%, and the error correction accuracy rate reaches 83% by adding character voice and font as characteristic training.

Description

Text error correction method applied to digital government
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a text error correction method applied to a digital government.
Background
Website information content is typically detected by manual inspection, system monitoring, public media feedback, and the like. Due to the fact that government affair disclosure is multi-faceted and wide in range, huge in information amount and high in timeliness requirement, the requirement cannot be met only through manual inspection. Therefore, systematic detection is the most important detection method, wherein the accuracy of the detection result is particularly important, and the workload of detection personnel is increased if wrong detection or missed detection is carried out.
The frequently used method for monitoring wrongly-written characters comprises a wrongly-written character dictionary, an editing distance and a language model, and the manual cost for constructing the dictionary is higher based on the error correction algorithm of the wrongly-written character dictionary, so that the method is suitable for the partial vertical field with limited wrongly-written characters; the error correction algorithm based on edit distance matching adopts a method similar to character string fuzzy matching, and can correct part of common wrongly written or mispronounced characters and language diseases by contrasting correct samples, but the universality is insufficient, so that the research on a text error correction method and a text error correction system applied to a digital government is necessary.
Disclosure of Invention
Aiming at the defects and problems of the existing equipment, the invention provides a text error correction method applied to the digital government, and the problems of high labor cost and poor universality of the existing error correction model are effectively solved.
The technical scheme adopted by the invention for solving the technical problems is as follows: a text error correction method applied to digital governments comprises the following steps:
s1, model training
(1) Obtaining a corpus
Integrating encyclopedia, headline and discordant news as a resource library to obtain a corpus;
(2) formulating confusion decision rules
The confusion rules comprise font confusion rules, word and sound confusion rules and character confusion rules; the font confusion rule adopts five-stroke reverse coding to carry out reverse five-stroke decomposition on the font, and the decomposition result is input into the font vector; randomly replacing one or more etymons according to the five error-prone libraries to form a corresponding font confusion set, wherein the coding distance for replacing one etymon is marked as 1, and the coding distance for correspondingly replacing N etymons is marked as N;
the word-sound confusion rule obtains the pinyin of the word through a pinyin dictionary, further obtains the initial consonant, the final sound and the tone of the pinyin of the word, and inputs the result into a word-sound vector; the following rules are formulated according to the pinyin dictionary;
the pronunciation is the same, the tone is the same, and the editing distance is equal to 0;
the pronunciation is the same, the tone is different, edit distance equals 1;
the tongue is flattened, the front and the back nasal sounds are flattened, and the editing distance is equal to 1;
changing one of the initial consonant or the final consonant, wherein the editing distance is equal to 1;
fifthly, changing the initial consonant and the final consonant, wherein the editing distance is larger than 1;
selecting editing distances with different lengths to generate a character and sound confusion set;
the editing distance of the character confusion rule is 1, and the confusion rule is that a character is randomly selected from a character library to be replaced;
(3) obtaining a confusion set
Randomly extracting sample characters from a corpus each time, preprocessing the sample characters to obtain a sample set, replacing the sample set according to a font confusion rule, a character and sound confusion rule and a character confusion rule respectively according to a specific proportion, and replacing the corresponding confusion set;
(4) model training
Taking the confusion set as an input set, taking the sample set as a comparison set, forming one-to-one corresponding sentence pairs, and performing model training in an end-to-end mode to finally obtain an error correction model;
s2, data acquisition
Receiving a website domain name, recursively acquiring links in all webpages, removing repeated links according to URL HASH, forming acquisition process information and acquisition results into a JSON format, using the JSON format as data acquisition results, and pushing the data acquisition results to a KAFKA message system;
s3, data cleaning
Subscribing to a KAFKA message, consuming the data collection results, and performing the steps of:
1. subscribing KAFKA information, consuming data acquisition results, identifying content types, and filtering non-HTML types;
2. webpage preprocessing: encoding characters of a page source code by using a charset attribute in the collected information, then serializing a collected result, and analyzing a webpage into a DOM tree;
3. extracting the webpage tags: extracting meta tags from DOM tree and performing classified storage
4. Extracting the webpage text: judging whether the meta tag extraction result contains a content page tag or not, and determining whether the webpage is a content page containing a text or not; taking out the body from the HTML source code of the content page, removing all tags in the body, wherein the tags comprise style styles, JavaScript scripts and annotation contents, reserving original line feed characters, and extracting the text contents of the webpage after denoising;
5. and (3) outputting a processing result: forming JSON by the tag extraction result and the text extraction result, taking the JSON as a data cleaning result, adding the JSON to the data acquisition result, and pushing the JSON to a KAFKA message system;
s4, text error correction
Subscribing to a KAFKA message, consuming the data cleansing result, performing the steps of:
1. cutting the webpage text into sentences according to punctuation marks and paragraphs, inputting the sentences into an error correction model, judging whether errors exist, if so, putting the corrected result into the sentences, inputting the error correction model again, performing recursive error correction, and if the two error correction results are the same, stopping the recursion, and obtaining the correction result of the error correction model;
2. forming sentences containing wrongly written characters and error correction results into sentences JSON, and forming sentence arrays by a plurality of JSONs;
3. correcting the wrongly-written character position, the first probability, the second probability, the third probability, the fusion probability, the first characteristic, the second characteristic, the third characteristic and the fusion characteristic of each sentence in the sentence array to form JSON, and adding a plurality of JSON forming arrays to the corresponding sentence JSON;
the first probability is the probability of character errors, the first characteristic is the corresponding character error characteristic, the second probability is the probability of character-pronunciation errors, the second characteristic is the corresponding character-pronunciation error characteristic, the third probability is the probability of character-shape errors, the third probability is the corresponding character-shape error characteristic, the fusion probability is the maximum value of the first probability, the second probability and the third probability, and the fusion characteristic is the fusion error characteristic corresponding to the maximum value;
4. and (3) outputting a processing result: adding the extracted text error correction result to a data cleaning result JSON, and pushing to a KAFKA message system;
s5, data storage
Subscribing KAFKA information, storing the data acquisition result, the data cleaning result and the text error correction result into an Elasticissearch storage system, and storing by taking URL HASH as a main key.
Further, in the process of obtaining the confusion set in S1, 15% of the characters are randomly extracted from the corpus each time, and 60% of the characters are subjected to word-sound confusion, 20% of the characters are subjected to font confusion, and 20% of the characters are subjected to font confusion.
Further, in S2, the collecting process information includes URL, IP, protocol, proxy, request mode, request time, collecting status, and server; the acquisition result includes a request header, a response header, and corresponding content.
Further, in S3, the rules for performing classification storage are:
firstly, website labeling: SiteName, SiteDomain, SiteIDcode, ColumName
Column label: ColumnDescription, ColumnKeywords, ColumnType
Page of contents tag: ArticleTitle, PubDate, ContentSource, Keywords, Author, Description, Image, Url.
Further, in S3, the text content of the web page is extracted by using a text extraction algorithm based on the text density and the symbol density of the web page after the noise reduction.
The invention has the beneficial effects that: the invention discloses a text error correction method and a text error correction system for government affair public information.
In order to solve the problem, the invention adds the character pronunciation, the character pattern and the character as features into a pre-training model for training, can improve the error correction accuracy rate of similar character pronunciation and similar character pattern, effectively lightens the workload of supervision and detection personnel, has the model error correction accuracy rate of about 70 percent, and has the error correction accuracy rate of 83 percent by adding the character pronunciation and the character pattern as the feature training error correction model.
Meanwhile, corresponding weights are added to the character pronunciation, the character form and the character in the error correction process, particularly, most of the existing input habits are pinyin input, and the weight of the character pronunciation exceeds half, so that the wrong character can be accurately judged, and the accuracy is improved.
Drawings
FIG. 1 is a schematic diagram of an error correction process according to the present invention.
Fig. 2 is a schematic diagram of fusion error correction of characters, pronunciation and font.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
Example 1: the embodiment aims to provide a text error correction method applied to the digital government, which is mainly used for correcting the text of a digital webpage, mainly considers the word tone, characters and font during error correction aiming at the problems of the existing error correction model, and also provides a text error correction method applied to the digital government.
When the embodiment is implemented, the method comprises the following steps
S1, model training
(1) First, a corpus is obtained
Integrating encyclopedias, headlines, discordant news and the like serving as resource libraries to obtain a corpus;
(2) formulating confusion decision rules
The confusion rules comprise font confusion rules, word and sound confusion rules and character confusion rules; the font confusion rule adopts five-stroke reverse coding to carry out reverse five-stroke decomposition on the font, and the decomposition result is input into the font vector; randomly replacing one or more etymons according to the five error-prone libraries to form a corresponding font confusion set, wherein the coding distance for replacing one etymon is marked as 1, and the coding distance for correspondingly replacing N etymons is marked as N; the five-stroke coding is used for splitting the character into a plurality of independent parts, so that the dimensionality can be effectively reduced compared with the stroke coding, and the effect and performance of the model are remarkably improved.
The word-sound confusion rule obtains the pinyin of the word through a pinyin dictionary, further obtains the initial consonant, the final sound and the tone of the pinyin of the word, and inputs the result into a word-sound vector; the following rules are made according to the pinyin dictionary and exemplified by san Xin Di's ā n x ī n r y mu;
firstly, pronunciations are the same, tones are the same, an editing distance is equal to 0, and the distance is recorded as 0000; for example, sanxin (s ā n);
② the pronunciation is the same, the tone is different, the edit distance is equal to 1, and it is written 0100; for example, the powder (s-a n) heart;
③ flattening the curled tongue and the front and back nasal sounds, and the editing distance is equal to 1 and is recorded as 001; for example, it is good (sh-ar-n) heart;
changing one of the initial consonant or the final sound, wherein the editing distance is equal to 1 and is marked as 0; for example Sen (s and n) Xin Di Yi;
fifthly, changing the initial consonant and the final consonant, wherein the editing distance is more than 1 and is marked as 1; for example, hurt (sh ā ng) hearts;
and selecting editing distances with different lengths to generate a character and sound confusion set.
The editing distance of the character confusion rule is 1, and the confusion rule is that one character is randomly selected from a character library to be replaced.
(3) Obtaining a confusion set
Randomly extracting sample characters from a corpus each time, preprocessing the sample characters to obtain a sample set, replacing the sample set according to a font confusion rule, a character and sound confusion rule and a character confusion rule respectively according to a specific proportion, and replacing the corresponding confusion set;
preferably, in the implementation, 15% of characters are randomly extracted from the corpus each time, and 60% of the characters are subjected to word-sound confusion, 20% of the characters are subjected to font confusion, and 20% of the characters are subjected to font confusion.
(4) Model training
And taking the confusion set as an input set, taking the sample set as a comparison set, and forming one-to-one corresponding sentence pairs, wherein the sentence pairs comprise correct sentences and wrong sentences containing wrong characters after mixing and stirring, and model training is carried out in an end-to-end mode to finally obtain an error correction model.
Data acquisition
The method comprises the steps of importing a website URL to be detected into a website data acquisition system, limiting the range in a website domain name, recursively acquiring links in all webpages (removing repeated links according to URL HASH), forming JSON format by acquiring process information (URL, IP, protocol, proxy, request mode, request time, acquisition state, server and the like) and acquisition results (request header, response header and corresponding content) as data acquisition results, and pushing the data acquisition results to a KAFKA message system.
Data cleansing
Subscribing to a KAFKA message, consuming a data collection result, performing the steps of:
1. subscribing to KAFKA messages, consuming data collection results, identifying content types, filtering non-HTML types.
2. Webpage preprocessing: and encoding characters of the page source code by using a charset attribute in the collected information to prevent the webpage from generating messy codes, and then serializing the collected result to analyze the webpage into a DOM tree.
3. Extracting the webpage tags: extracting meta tag from DOM tree and storing by classification
Firstly, website labeling: SiteName, SiteDomain, SiteIDcode, ColumName
Column label: ColumnDescription, ColumnKeywords, ColumnType
Page of contents tag: articletile, PubDate, ContentSource, Keywords, Author, Description, Image, Url.
4. Extracting the webpage text: judging whether the meta tag extraction result contains a content page tag or not, and determining whether the webpage is a content page containing a text or not; and taking out the body from the HTML source code of the content page, removing all tags in the body, including style styles, JavaScript scripts, annotation contents and the like, reserving the original line feed, and extracting the text content of the webpage by using a text extraction algorithm based on the text density and the symbol density of the webpage after noise reduction.
5. And (3) outputting a processing result: and forming JSON by the tag extraction result and the text extraction result, taking the JSON as a data cleaning result, adding the JSON to the data acquisition result, and pushing the JSON to a KAFKA message system.
Text error correction
Subscribing to a KAFKA message, consuming the data cleansing result, performing the steps of:
1. cutting the webpage text into sentences according to punctuation marks and paragraphs, firstly inputting the sentences into an error correction model, judging whether errors exist, if so, putting the corrected result into the sentences to input the model again, carrying out recursive error correction, and if the two error correction results are the same, stopping the recursion to obtain a model correction result; including the error probabilities (i.e., the first, second, third probabilities) and the corrected characters (i.e., the first, second, third characteristics) of the character, the pronunciation, and the font.
2. Forming sentences containing wrongly written characters and error correction results into sentences JSON, and forming sentence arrays by a plurality of JSONs;
3. correcting the wrongly-written character position, the first probability, the second probability, the third probability, the fusion probability, the first characteristic, the second characteristic, the third characteristic and the fusion characteristic of each sentence in the sentence array to form JSON, and adding a plurality of JSON forming arrays to the corresponding sentence JSON.
The first probability is the probability of font error, the first characteristic is the corresponding font error characteristic, the second probability is the probability of character error, the second characteristic is the corresponding character error characteristic, the third probability is the probability of character error, the third probability is the corresponding character error characteristic, the fusion probability is the maximum value of the first probability, the second probability and the third probability, and the fusion characteristic is the corresponding fusion error characteristic.
4. And (3) outputting a processing result: and adding the text extraction error correction result to the data cleaning result JSON, and pushing to the KAFKA message system.
Data storage
Subscribing KAFKA information, storing the data acquisition result, the data cleaning result and the text error correction result into an Elasticissearch storage system, and storing by taking URL HASH as a main key.

Claims (5)

1. A text error correction method applied to digital governments is characterized in that: the method comprises the following steps:
s1, model training
(1) First, a corpus is obtained
Integrating encyclopedia, headline and discordant news as a resource library to obtain a corpus;
(2) formulating confusion decision rules
The confusion rules comprise font confusion rules, word and sound confusion rules and character confusion rules; the font confusion rule adopts five-stroke reverse coding to carry out reverse five-stroke decomposition on the font, and the decomposition result is input into the font vector; randomly replacing one or more etymons according to the five error-prone libraries to form a corresponding font confusion set, wherein the coding distance for replacing one etymon is marked as 1, and the coding distance for correspondingly replacing N etymons is marked as N;
the word-sound confusion rule obtains the pinyin of the word through a pinyin dictionary, further obtains the initial consonant, the final sound and the tone of the pinyin of the word, and inputs the result into a word-sound vector; the following rules are formulated according to the pinyin dictionary;
the pronunciation is the same, the tone is the same, and the editing distance is equal to 0;
the pronunciation is the same, the tone is different, edit distance equals 1;
the tongue is flattened, the front and the back nasal sounds are flattened, and the editing distance is equal to 1;
changing one of the initial consonant or the final sound, wherein the editing distance is equal to 1;
fifthly, changing the initial consonant and the final consonant, wherein the editing distance is larger than 1;
selecting editing distances with different lengths to generate a character and sound confusion set;
the editing distance of the character confusion rule is 1, and the confusion rule is that a character is randomly selected from a character library to be replaced;
(3) obtaining a confusion set
Randomly extracting sample characters from a corpus each time, preprocessing the sample characters to obtain a sample set, replacing the sample set according to a font confusion rule, a character and sound confusion rule and a character confusion rule respectively according to a specific proportion, and replacing the corresponding confusion set;
(4) model training
Taking the confusion set as an input set, taking the sample set as a comparison set, forming one-to-one corresponding sentence pairs, and performing model training in an end-to-end mode to finally obtain an error correction model;
s2, data acquisition
Receiving a website domain name, recursively acquiring links in all webpages, removing repeated links according to URL HASH, forming acquisition process information and acquisition results into a JSON format, and pushing the JSON format as a data acquisition result to a KAFKA message system;
s3, data cleaning
Subscribing to a KAFKA message, consuming the data collection results, and performing the steps of:
1. subscribing KAFKA information, consuming data acquisition results, identifying content types, and filtering non-HTML types;
2. webpage preprocessing: encoding characters of a page source code by using a charset attribute in the collected information, then serializing a collection result, and analyzing a webpage into a DOM tree;
3. extracting the webpage label: extracting meta tags from DOM tree and performing classified storage
4. Extracting the webpage text: judging whether the meta tag extraction result contains a content page tag or not, and determining whether the webpage is a content page containing a text or not; taking out the body from the HTML source code of the content page, removing all tags in the body, wherein the tags comprise style styles, JavaScript scripts and annotation contents, reserving original line feed characters, and extracting the text contents of the webpage after denoising;
5. and (3) outputting a processing result: forming JSON by the tag extraction result and the text extraction result, taking the JSON as a data cleaning result, adding the JSON to the data acquisition result, and pushing the JSON to a KAFKA message system;
s4, text correction
Subscribing to a KAFKA message, consuming the data cleansing result, performing the steps of:
1. cutting the webpage text into sentences according to punctuation marks and paragraphs, inputting the sentences into an error correction model, judging whether errors exist, if so, putting the corrected result into the sentences, inputting the error correction model again, performing recursive error correction, and if the two error correction results are the same, stopping the recursion, and obtaining the correction result of the error correction model;
2. forming sentences containing wrongly written characters and error correction results into sentences JSON, and forming sentence arrays by a plurality of JSONs;
3. correcting the wrongly written character position, the first probability, the second probability, the third probability, the fusion probability, the first characteristic, the second characteristic, the third characteristic and the fusion characteristic of each sentence in the sentence array to form JSON, and adding a plurality of JSON forming arrays to the corresponding sentences JSON;
the first probability is the probability of character errors, the first characteristic is the corresponding character error characteristic, the second probability is the probability of character-pronunciation errors, the second characteristic is the corresponding character-pronunciation error characteristic, the third probability is the probability of character-shape errors, the third probability is the corresponding character-shape error characteristic, the fusion probability is the maximum value of the first probability, the second probability and the third probability, and the fusion characteristic is the fusion error characteristic corresponding to the maximum value;
4. and (3) outputting a processing result: adding the extracted text error correction result to a data cleaning result JSON, and pushing to a KAFKA message system;
s5, data storage
Subscribing KAFKA information, storing the data acquisition result, the data cleaning result and the text error correction result into an Elasticissearch storage system, and storing by taking URL HASH as a main key.
2. The text error correction method applied to the digital government according to claim 1, wherein: in the process of obtaining the confusion set in S1, 15% of the characters are randomly extracted from the corpus each time, and 60% of the characters are subjected to word-sound confusion, 20% of the characters are subjected to font confusion, and 20% of the characters are subjected to font confusion.
3. The text error correction method applied to the digital government according to claim 1, wherein: in S2, the acquisition process information includes URL, IP, protocol, proxy, request mode, request time, acquisition status, and server; the acquisition results include a request header, a response header, and corresponding content.
4. The text error correction method applied to the digital government according to claim 1, wherein: in S3, the rules for classified storage are:
firstly, website labeling: SiteName, SiteDomain, SiteIDcode, ColumName
Column label: ColumnDescription, ColumnKeywords, ColumnType
Page of contents tag: ArticleTitle, PubDate, ContentSource, Keywords, Author, Description, Image, Url.
5. The text error correction method applied to the digital government according to claim 1, wherein: at S3, the text content of the web page is extracted using a text extraction algorithm based on the text density and symbol density of the web page after noise reduction.
CN202111633076.4A 2021-12-29 2021-12-29 Text error correction method applied to digital government Active CN114970502B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111633076.4A CN114970502B (en) 2021-12-29 2021-12-29 Text error correction method applied to digital government

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111633076.4A CN114970502B (en) 2021-12-29 2021-12-29 Text error correction method applied to digital government

Publications (2)

Publication Number Publication Date
CN114970502A true CN114970502A (en) 2022-08-30
CN114970502B CN114970502B (en) 2023-03-28

Family

ID=82974441

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111633076.4A Active CN114970502B (en) 2021-12-29 2021-12-29 Text error correction method applied to digital government

Country Status (1)

Country Link
CN (1) CN114970502B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438650A (en) * 2022-11-08 2022-12-06 深圳擎盾信息科技有限公司 Contract text error correction method, system, equipment and medium fusing multi-source characteristics
CN117236319A (en) * 2023-09-25 2023-12-15 中国—东盟信息港股份有限公司 Real scene Chinese text error correction method based on transducer generation model

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11328317A (en) * 1998-05-11 1999-11-30 Nippon Telegr & Teleph Corp <Ntt> Method and device for correcting japanese character recognition error and recording medium with error correcting program recorded
CN1687877A (en) * 2005-04-14 2005-10-26 刘伊翰 Chinese character input method capable of using English
CN104916169A (en) * 2015-05-20 2015-09-16 江苏理工学院 Card type German learning tool
CN110489760A (en) * 2019-09-17 2019-11-22 达而观信息科技(上海)有限公司 Based on deep neural network text auto-collation and device
CN110765740A (en) * 2019-10-11 2020-02-07 深圳市比一比网络科技有限公司 DOM tree-based full-type text replacement method, system, device and storage medium
CN112016310A (en) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 Text error correction method, system, device and readable storage medium
CN112199945A (en) * 2020-08-19 2021-01-08 宿迁硅基智能科技有限公司 Text error correction method and device
CN112287670A (en) * 2020-11-18 2021-01-29 北京明略软件***有限公司 Text error correction method, system, computer device and readable storage medium
CN113361266A (en) * 2021-06-25 2021-09-07 达闼机器人有限公司 Text error correction method, electronic device and storage medium
CN113642316A (en) * 2021-07-28 2021-11-12 平安国际智慧城市科技股份有限公司 Chinese text error correction method and device, electronic equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11328317A (en) * 1998-05-11 1999-11-30 Nippon Telegr & Teleph Corp <Ntt> Method and device for correcting japanese character recognition error and recording medium with error correcting program recorded
CN1687877A (en) * 2005-04-14 2005-10-26 刘伊翰 Chinese character input method capable of using English
CN104916169A (en) * 2015-05-20 2015-09-16 江苏理工学院 Card type German learning tool
CN110489760A (en) * 2019-09-17 2019-11-22 达而观信息科技(上海)有限公司 Based on deep neural network text auto-collation and device
CN110765740A (en) * 2019-10-11 2020-02-07 深圳市比一比网络科技有限公司 DOM tree-based full-type text replacement method, system, device and storage medium
CN112199945A (en) * 2020-08-19 2021-01-08 宿迁硅基智能科技有限公司 Text error correction method and device
CN112016310A (en) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 Text error correction method, system, device and readable storage medium
WO2021189851A1 (en) * 2020-09-03 2021-09-30 平安科技(深圳)有限公司 Text error correction method, system and device, and readable storage medium
CN112287670A (en) * 2020-11-18 2021-01-29 北京明略软件***有限公司 Text error correction method, system, computer device and readable storage medium
CN113361266A (en) * 2021-06-25 2021-09-07 达闼机器人有限公司 Text error correction method, electronic device and storage medium
CN113642316A (en) * 2021-07-28 2021-11-12 平安国际智慧城市科技股份有限公司 Chinese text error correction method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JUNJIE YU等: "Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape" *
李建义等: "关于中文拼写纠错数据增强的方法" *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438650A (en) * 2022-11-08 2022-12-06 深圳擎盾信息科技有限公司 Contract text error correction method, system, equipment and medium fusing multi-source characteristics
CN117236319A (en) * 2023-09-25 2023-12-15 中国—东盟信息港股份有限公司 Real scene Chinese text error correction method based on transducer generation model
CN117236319B (en) * 2023-09-25 2024-04-19 中国—东盟信息港股份有限公司 Real scene Chinese text error correction method based on transducer generation model

Also Published As

Publication number Publication date
CN114970502B (en) 2023-03-28

Similar Documents

Publication Publication Date Title
CN109918666B (en) Chinese punctuation mark adding method based on neural network
CN102253937B (en) Method and related device for acquiring information of interest in webpages
CN109145260B (en) Automatic text information extraction method
CN112163424B (en) Data labeling method, device, equipment and medium
CN110609983B (en) Structured decomposition method for policy file
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN114970502B (en) Text error correction method applied to digital government
CN110175585B (en) Automatic correcting system and method for simple answer questions
CN111274804A (en) Case information extraction method based on named entity recognition
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN113255331B (en) Text error correction method, device and storage medium
CN113076720B (en) Long text segmentation method and device, storage medium and electronic device
CN109165373B (en) Data processing method and device
CN105786971B (en) A kind of grammer point recognition methods towards international Chinese teaching
CN111967267A (en) XLNET-based news text region extraction method and system
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN111325019A (en) Word bank updating method and device and electronic equipment
CN114254077A (en) Method for evaluating integrity of manuscript based on natural language
CN112818693A (en) Automatic extraction method and system for electronic component model words
CN107451215B (en) Feature text extraction method and device
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN115017271A (en) Method and system for intelligently generating RPA flow component block
CN111310457B (en) Word mismatching recognition method and device, electronic equipment and storage medium
CN110472243B (en) Chinese spelling checking method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
OL01 Intention to license declared
OL01 Intention to license declared