CN114970502A

CN114970502A - Text error correction method applied to digital government

Info

Publication number: CN114970502A
Application number: CN202111633076.4A
Authority: CN
Inventors: 吴琼; 常诚; 王元卓
Original assignee: China Science And Technology Big Data Research Institute
Current assignee: China Science And Technology Big Data Research Institute
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-08-30
Anticipated expiration: 2041-12-29
Also published as: CN114970502B

Abstract

The invention belongs to the technical field of computers, and particularly relates to a text error correction method applied to a digital government, which comprises a method and a flow of model training, data acquisition, data cleaning, text error correction and data storage, wherein character voice, font and characters are used as characteristics and added into a pre-training model for training, so that the error correction accuracy rate with similar character voice and similar font can be improved, the workload of supervision and detection personnel is effectively reduced, the model error correction accuracy rate is about 70%, and the error correction accuracy rate reaches 83% by adding character voice and font as characteristic training.

Description

Text error correction method applied to digital government

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a text error correction method applied to a digital government.

Background

Website information content is typically detected by manual inspection, system monitoring, public media feedback, and the like. Due to the fact that government affair disclosure is multi-faceted and wide in range, huge in information amount and high in timeliness requirement, the requirement cannot be met only through manual inspection. Therefore, systematic detection is the most important detection method, wherein the accuracy of the detection result is particularly important, and the workload of detection personnel is increased if wrong detection or missed detection is carried out.

The frequently used method for monitoring wrongly-written characters comprises a wrongly-written character dictionary, an editing distance and a language model, and the manual cost for constructing the dictionary is higher based on the error correction algorithm of the wrongly-written character dictionary, so that the method is suitable for the partial vertical field with limited wrongly-written characters; the error correction algorithm based on edit distance matching adopts a method similar to character string fuzzy matching, and can correct part of common wrongly written or mispronounced characters and language diseases by contrasting correct samples, but the universality is insufficient, so that the research on a text error correction method and a text error correction system applied to a digital government is necessary.

Disclosure of Invention

Aiming at the defects and problems of the existing equipment, the invention provides a text error correction method applied to the digital government, and the problems of high labor cost and poor universality of the existing error correction model are effectively solved.

The technical scheme adopted by the invention for solving the technical problems is as follows: a text error correction method applied to digital governments comprises the following steps:

s1, model training

(1) Obtaining a corpus

Integrating encyclopedia, headline and discordant news as a resource library to obtain a corpus;

(2) formulating confusion decision rules

The confusion rules comprise font confusion rules, word and sound confusion rules and character confusion rules; the font confusion rule adopts five-stroke reverse coding to carry out reverse five-stroke decomposition on the font, and the decomposition result is input into the font vector; randomly replacing one or more etymons according to the five error-prone libraries to form a corresponding font confusion set, wherein the coding distance for replacing one etymon is marked as 1, and the coding distance for correspondingly replacing N etymons is marked as N;

the word-sound confusion rule obtains the pinyin of the word through a pinyin dictionary, further obtains the initial consonant, the final sound and the tone of the pinyin of the word, and inputs the result into a word-sound vector; the following rules are formulated according to the pinyin dictionary;

the pronunciation is the same, the tone is the same, and the editing distance is equal to 0;

the pronunciation is the same, the tone is different, edit distance equals 1;

the tongue is flattened, the front and the back nasal sounds are flattened, and the editing distance is equal to 1;

changing one of the initial consonant or the final consonant, wherein the editing distance is equal to 1;

fifthly, changing the initial consonant and the final consonant, wherein the editing distance is larger than 1;

selecting editing distances with different lengths to generate a character and sound confusion set;

the editing distance of the character confusion rule is 1, and the confusion rule is that a character is randomly selected from a character library to be replaced;

(3) obtaining a confusion set

Randomly extracting sample characters from a corpus each time, preprocessing the sample characters to obtain a sample set, replacing the sample set according to a font confusion rule, a character and sound confusion rule and a character confusion rule respectively according to a specific proportion, and replacing the corresponding confusion set;

(4) model training

Taking the confusion set as an input set, taking the sample set as a comparison set, forming one-to-one corresponding sentence pairs, and performing model training in an end-to-end mode to finally obtain an error correction model;

s2, data acquisition

Receiving a website domain name, recursively acquiring links in all webpages, removing repeated links according to URL HASH, forming acquisition process information and acquisition results into a JSON format, using the JSON format as data acquisition results, and pushing the data acquisition results to a KAFKA message system;

s3, data cleaning

Subscribing to a KAFKA message, consuming the data collection results, and performing the steps of:

1. subscribing KAFKA information, consuming data acquisition results, identifying content types, and filtering non-HTML types;

2. webpage preprocessing: encoding characters of a page source code by using a charset attribute in the collected information, then serializing a collected result, and analyzing a webpage into a DOM tree;

3. extracting the webpage tags: extracting meta tags from DOM tree and performing classified storage

4. Extracting the webpage text: judging whether the meta tag extraction result contains a content page tag or not, and determining whether the webpage is a content page containing a text or not; taking out the body from the HTML source code of the content page, removing all tags in the body, wherein the tags comprise style styles, JavaScript scripts and annotation contents, reserving original line feed characters, and extracting the text contents of the webpage after denoising;

5. and (3) outputting a processing result: forming JSON by the tag extraction result and the text extraction result, taking the JSON as a data cleaning result, adding the JSON to the data acquisition result, and pushing the JSON to a KAFKA message system;

s4, text error correction

Subscribing to a KAFKA message, consuming the data cleansing result, performing the steps of:

1. cutting the webpage text into sentences according to punctuation marks and paragraphs, inputting the sentences into an error correction model, judging whether errors exist, if so, putting the corrected result into the sentences, inputting the error correction model again, performing recursive error correction, and if the two error correction results are the same, stopping the recursion, and obtaining the correction result of the error correction model;

2. forming sentences containing wrongly written characters and error correction results into sentences JSON, and forming sentence arrays by a plurality of JSONs;

3. correcting the wrongly-written character position, the first probability, the second probability, the third probability, the fusion probability, the first characteristic, the second characteristic, the third characteristic and the fusion characteristic of each sentence in the sentence array to form JSON, and adding a plurality of JSON forming arrays to the corresponding sentence JSON;

the first probability is the probability of character errors, the first characteristic is the corresponding character error characteristic, the second probability is the probability of character-pronunciation errors, the second characteristic is the corresponding character-pronunciation error characteristic, the third probability is the probability of character-shape errors, the third probability is the corresponding character-shape error characteristic, the fusion probability is the maximum value of the first probability, the second probability and the third probability, and the fusion characteristic is the fusion error characteristic corresponding to the maximum value;

4. and (3) outputting a processing result: adding the extracted text error correction result to a data cleaning result JSON, and pushing to a KAFKA message system;

s5, data storage

Subscribing KAFKA information, storing the data acquisition result, the data cleaning result and the text error correction result into an Elasticissearch storage system, and storing by taking URL HASH as a main key.

Further, in the process of obtaining the confusion set in S1, 15% of the characters are randomly extracted from the corpus each time, and 60% of the characters are subjected to word-sound confusion, 20% of the characters are subjected to font confusion, and 20% of the characters are subjected to font confusion.

Further, in S2, the collecting process information includes URL, IP, protocol, proxy, request mode, request time, collecting status, and server; the acquisition result includes a request header, a response header, and corresponding content.

Further, in S3, the rules for performing classification storage are:

firstly, website labeling: SiteName, SiteDomain, SiteIDcode, ColumName

Column label: ColumnDescription, ColumnKeywords, ColumnType

Page of contents tag: ArticleTitle, PubDate, ContentSource, Keywords, Author, Description, Image, Url.

Further, in S3, the text content of the web page is extracted by using a text extraction algorithm based on the text density and the symbol density of the web page after the noise reduction.

The invention has the beneficial effects that: the invention discloses a text error correction method and a text error correction system for government affair public information.

In order to solve the problem, the invention adds the character pronunciation, the character pattern and the character as features into a pre-training model for training, can improve the error correction accuracy rate of similar character pronunciation and similar character pattern, effectively lightens the workload of supervision and detection personnel, has the model error correction accuracy rate of about 70 percent, and has the error correction accuracy rate of 83 percent by adding the character pronunciation and the character pattern as the feature training error correction model.

Meanwhile, corresponding weights are added to the character pronunciation, the character form and the character in the error correction process, particularly, most of the existing input habits are pinyin input, and the weight of the character pronunciation exceeds half, so that the wrong character can be accurately judged, and the accuracy is improved.

Drawings

FIG. 1 is a schematic diagram of an error correction process according to the present invention.

Fig. 2 is a schematic diagram of fusion error correction of characters, pronunciation and font.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

Example 1: the embodiment aims to provide a text error correction method applied to the digital government, which is mainly used for correcting the text of a digital webpage, mainly considers the word tone, characters and font during error correction aiming at the problems of the existing error correction model, and also provides a text error correction method applied to the digital government.

When the embodiment is implemented, the method comprises the following steps

S1, model training

(1) First, a corpus is obtained

Integrating encyclopedias, headlines, discordant news and the like serving as resource libraries to obtain a corpus;

(2) formulating confusion decision rules

The confusion rules comprise font confusion rules, word and sound confusion rules and character confusion rules; the font confusion rule adopts five-stroke reverse coding to carry out reverse five-stroke decomposition on the font, and the decomposition result is input into the font vector; randomly replacing one or more etymons according to the five error-prone libraries to form a corresponding font confusion set, wherein the coding distance for replacing one etymon is marked as 1, and the coding distance for correspondingly replacing N etymons is marked as N; the five-stroke coding is used for splitting the character into a plurality of independent parts, so that the dimensionality can be effectively reduced compared with the stroke coding, and the effect and performance of the model are remarkably improved.

The word-sound confusion rule obtains the pinyin of the word through a pinyin dictionary, further obtains the initial consonant, the final sound and the tone of the pinyin of the word, and inputs the result into a word-sound vector; the following rules are made according to the pinyin dictionary and exemplified by san Xin Di's ā n x ī n r y mu;

firstly, pronunciations are the same, tones are the same, an editing distance is equal to 0, and the distance is recorded as 0000; for example, sanxin (s ā n);

② the pronunciation is the same, the tone is different, the edit distance is equal to 1, and it is written 0100; for example, the powder (s-a n) heart;

③ flattening the curled tongue and the front and back nasal sounds, and the editing distance is equal to 1 and is recorded as 001; for example, it is good (sh-ar-n) heart;

changing one of the initial consonant or the final sound, wherein the editing distance is equal to 1 and is marked as 0; for example Sen (s and n) Xin Di Yi;

fifthly, changing the initial consonant and the final consonant, wherein the editing distance is more than 1 and is marked as 1; for example, hurt (sh ā ng) hearts;

and selecting editing distances with different lengths to generate a character and sound confusion set.

The editing distance of the character confusion rule is 1, and the confusion rule is that one character is randomly selected from a character library to be replaced.

(3) Obtaining a confusion set

preferably, in the implementation, 15% of characters are randomly extracted from the corpus each time, and 60% of the characters are subjected to word-sound confusion, 20% of the characters are subjected to font confusion, and 20% of the characters are subjected to font confusion.

(4) Model training

And taking the confusion set as an input set, taking the sample set as a comparison set, and forming one-to-one corresponding sentence pairs, wherein the sentence pairs comprise correct sentences and wrong sentences containing wrong characters after mixing and stirring, and model training is carried out in an end-to-end mode to finally obtain an error correction model.

Data acquisition

The method comprises the steps of importing a website URL to be detected into a website data acquisition system, limiting the range in a website domain name, recursively acquiring links in all webpages (removing repeated links according to URL HASH), forming JSON format by acquiring process information (URL, IP, protocol, proxy, request mode, request time, acquisition state, server and the like) and acquisition results (request header, response header and corresponding content) as data acquisition results, and pushing the data acquisition results to a KAFKA message system.

Data cleansing

Subscribing to a KAFKA message, consuming a data collection result, performing the steps of:

1. subscribing to KAFKA messages, consuming data collection results, identifying content types, filtering non-HTML types.

2. Webpage preprocessing: and encoding characters of the page source code by using a charset attribute in the collected information to prevent the webpage from generating messy codes, and then serializing the collected result to analyze the webpage into a DOM tree.

3. Extracting the webpage tags: extracting meta tag from DOM tree and storing by classification

Firstly, website labeling: SiteName, SiteDomain, SiteIDcode, ColumName

Column label: ColumnDescription, ColumnKeywords, ColumnType

Page of contents tag: articletile, PubDate, ContentSource, Keywords, Author, Description, Image, Url.

4. Extracting the webpage text: judging whether the meta tag extraction result contains a content page tag or not, and determining whether the webpage is a content page containing a text or not; and taking out the body from the HTML source code of the content page, removing all tags in the body, including style styles, JavaScript scripts, annotation contents and the like, reserving the original line feed, and extracting the text content of the webpage by using a text extraction algorithm based on the text density and the symbol density of the webpage after noise reduction.

5. And (3) outputting a processing result: and forming JSON by the tag extraction result and the text extraction result, taking the JSON as a data cleaning result, adding the JSON to the data acquisition result, and pushing the JSON to a KAFKA message system.

Text error correction

1. cutting the webpage text into sentences according to punctuation marks and paragraphs, firstly inputting the sentences into an error correction model, judging whether errors exist, if so, putting the corrected result into the sentences to input the model again, carrying out recursive error correction, and if the two error correction results are the same, stopping the recursion to obtain a model correction result; including the error probabilities (i.e., the first, second, third probabilities) and the corrected characters (i.e., the first, second, third characteristics) of the character, the pronunciation, and the font.

3. correcting the wrongly-written character position, the first probability, the second probability, the third probability, the fusion probability, the first characteristic, the second characteristic, the third characteristic and the fusion characteristic of each sentence in the sentence array to form JSON, and adding a plurality of JSON forming arrays to the corresponding sentence JSON.

The first probability is the probability of font error, the first characteristic is the corresponding font error characteristic, the second probability is the probability of character error, the second characteristic is the corresponding character error characteristic, the third probability is the probability of character error, the third probability is the corresponding character error characteristic, the fusion probability is the maximum value of the first probability, the second probability and the third probability, and the fusion characteristic is the corresponding fusion error characteristic.

4. And (3) outputting a processing result: and adding the text extraction error correction result to the data cleaning result JSON, and pushing to the KAFKA message system.

Data storage

Claims

1. A text error correction method applied to digital governments is characterized in that: the method comprises the following steps:

s1, model training

(1) First, a corpus is obtained

(2) formulating confusion decision rules

the pronunciation is the same, the tone is different, edit distance equals 1;

changing one of the initial consonant or the final sound, wherein the editing distance is equal to 1;

(3) obtaining a confusion set

(4) model training

s2, data acquisition

Receiving a website domain name, recursively acquiring links in all webpages, removing repeated links according to URL HASH, forming acquisition process information and acquisition results into a JSON format, and pushing the JSON format as a data acquisition result to a KAFKA message system;

s3, data cleaning

2. webpage preprocessing: encoding characters of a page source code by using a charset attribute in the collected information, then serializing a collection result, and analyzing a webpage into a DOM tree;

3. extracting the webpage label: extracting meta tags from DOM tree and performing classified storage

s4, text correction

3. correcting the wrongly written character position, the first probability, the second probability, the third probability, the fusion probability, the first characteristic, the second characteristic, the third characteristic and the fusion characteristic of each sentence in the sentence array to form JSON, and adding a plurality of JSON forming arrays to the corresponding sentences JSON;

s5, data storage

2. The text error correction method applied to the digital government according to claim 1, wherein: in the process of obtaining the confusion set in S1, 15% of the characters are randomly extracted from the corpus each time, and 60% of the characters are subjected to word-sound confusion, 20% of the characters are subjected to font confusion, and 20% of the characters are subjected to font confusion.

3. The text error correction method applied to the digital government according to claim 1, wherein: in S2, the acquisition process information includes URL, IP, protocol, proxy, request mode, request time, acquisition status, and server; the acquisition results include a request header, a response header, and corresponding content.

4. The text error correction method applied to the digital government according to claim 1, wherein: in S3, the rules for classified storage are:

firstly, website labeling: SiteName, SiteDomain, SiteIDcode, ColumName

Column label: ColumnDescription, ColumnKeywords, ColumnType

5. The text error correction method applied to the digital government according to claim 1, wherein: at S3, the text content of the web page is extracted using a text extraction algorithm based on the text density and symbol density of the web page after noise reduction.