CN117313706A

CN117313706A - Method and device for generating error corpus

Info

Publication number: CN117313706A
Application number: CN202311274183.1A
Authority: CN
Inventors: 王耀峰; 张学来; 王雅静; 应志红; 康丽丽
Original assignee: Beijing Thunisoft Information Technology Co ltd
Current assignee: Beijing Thunisoft Information Technology Co ltd
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2023-12-29

Abstract

The application discloses a method and a device for generating an error corpus, wherein the method comprises the following steps: acquiring a reference text; determining an error type of the clause result of the reference text based on the sentence pattern requirement satisfied by the clause result of the reference text, wherein the error type comprises a special error type and a common error type; replacing the clause result of the reference text based on the error type and a preset error type processing rule of the clause result of the reference text until the number of the error types meets the preset number, and completing the generation of an error corpus, wherein the preset error type processing rule comprises: the processing priority of the error type and the generation rule of the sub-error type can improve the accuracy of the error corpus and the comprehensiveness of the error corpus.

Description

Method and device for generating error corpus

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method and a device for generating an error corpus.

Background

As the requirements of various industries on the error rate of text become higher, so too is the requirement of error word detection. With the development of the Internet and artificial intelligence, text error correction in the prior art is mainly accomplished by means of artificial intelligence training. The authenticity of the false corpus is important for artificial intelligence training.

At present, the existing error corpus mainly aims at homophonic, phonic and shape-like errors. However, in real scenes, there are also problems of multiple inputs, associated inputs, lack of words, semantic repetition, semantic errors, etc. For formal scenes such as legal documents, there may be a problem that logic is not strict. The existing error corpus is small in scale, low in accuracy and high in comprehensiveness, so that the method is difficult to use for training a model or verifying the model.

Based on the above, the present specification provides a new generation method of an erroneous corpus.

Disclosure of Invention

The embodiment of the application provides a generation method of an error corpus, which is used for solving the following problems: the existing error corpus is small in scale, low in accuracy and high in comprehensiveness, so that the method is difficult to use for training a model or verifying the model.

Specifically, the method for generating the error corpus comprises the following steps:

acquiring a reference text;

determining an error type of the clause result of the reference text based on the sentence pattern requirement satisfied by the clause result of the reference text, wherein the error type comprises a special error type and a common error type;

replacing the clause result of the reference text based on the error type and a preset error type processing rule of the clause result of the reference text until the number of the error types meets the preset number, and completing the generation of an error corpus, wherein the preset error type processing rule comprises: processing priority of error type and generation rule of sub error type.

The embodiment of the application also provides a device for generating the error corpus.

Specifically, a device for generating an error corpus includes:

the acquisition module acquires a reference text;

the judging module is used for determining the error type of the clause result of the reference text based on the sentence pattern requirement satisfied by the clause result of the reference text, wherein the error type comprises a special error type and a common error type;

the corpus generation module is used for replacing the clause result of the reference text based on the error type of the clause result of the reference text and a preset error type processing rule, until the number of the error types meets the preset number, and the generation of the error corpus is completed, wherein the preset error type processing rule comprises: processing priority of error type and generation rule of sub error type.

The technical scheme provided by the embodiment of the application has at least the following beneficial effects: acquiring a reference text; determining an error type of the clause result of the reference text based on the sentence pattern requirement satisfied by the clause result of the reference text, wherein the error type comprises a special error type and a common error type; replacing the clause result of the reference text based on the error type and a preset error type processing rule of the clause result of the reference text until the number of the error types meets the preset number, and completing the generation of an error corpus, wherein the preset error type processing rule comprises: the processing priority of the error type and the generation rule of the sub-error type can improve the accuracy of the error corpus and the comprehensiveness of the error corpus.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

fig. 1 is a schematic diagram of a system architecture of a method for generating an error corpus according to an embodiment of the present disclosure

Fig. 2 is a flow chart of a method for generating an error corpus according to an embodiment of the present disclosure;

fig. 3 is a frame diagram of a method for generating an error corpus according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a device for generating an error corpus according to an embodiment of the present disclosure.

Detailed Description

For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Fig. 1 is a schematic diagram of a system architecture of a method for generating an error corpus according to an embodiment of the present disclosure. As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The terminal devices 101, 102, 103 interact with the server 105 via the network 104 to receive or send messages or the like. Various client applications can be installed on the terminal devices 101, 102, 103. For example, a dedicated program such as generation of an error corpus is performed.

The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be a variety of special purpose or general purpose electronic devices including, but not limited to, smartphones, tablets, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., multiple software or software modules for providing distributed services) or as a single software or software module.

The server 105 may be a server providing various services, such as a back-end server providing services for client applications installed on the terminal devices 101, 102, 103. For example, the server may perform generation of the error corpus so as to display the generated error corpus on the terminal device servers 101, 102, 103, or may perform generation of the error corpus so as to display the generated error corpus on the terminal devices 101, 102, 103.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When server 105 is software, it may be implemented as multiple software or software modules (e.g., multiple software or software modules for providing distributed services), or as a single software or software module.

Fig. 2 is a flow chart of a method for generating an error corpus according to an embodiment of the present disclosure. From the program perspective, the execution subject of the flow may be a program installed on an application server or an application terminal. It is understood that the method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. As shown in fig. 2, the generating method includes:

step S201: and acquiring the reference text.

In the embodiment of the present specification, the reference text refers to a real, accurate and semantically complete chinese text. In the process of generating the error corpus, the reference text is used as a template, and is modified based on errors of different categories, so that the error corpus is generated. In one embodiment of the present specification, the reference text is text derived from the legal field. The number of references is one or more, and the specific number of references is not limiting of the present application.

Step S203: and determining the error type of the clause result of the reference text based on the sentence pattern requirement satisfied by the clause result of the reference text, wherein the error type comprises a special error type and a common error type.

Since the reference text generally includes a plurality of sentences, when generating an erroneous corpus, it is necessary to perform sentence segmentation on the reference text to obtain a sentence segmentation result. Specifically, a sentence period in the reference text is used as a clause mark, and a section of text ending with each clause is used as a clause, so that a clause result is obtained.

In this embodiment of the present disclosure, the determining, based on the sentence pattern requirement satisfied by the clause result of the reference text, the error type to which the clause result of the reference text belongs specifically includes:

and determining the error type of the clause result of the reference text based on the matching result of the preset word stock included in the clause result of the reference text, wherein the preset word stock is the word stock corresponding to each subtype included in the error type.

In this embodiment of the present disclosure, the preset word stock includes a first preset word stock corresponding to a specific error type and a second preset word stock corresponding to a common error type. The first preset word stock comprises sub word stocks of all sub error types corresponding to the special error types, and the second preset word stock comprises sub word stocks of all sub error types corresponding to the common error types.

For the construction of the preset word stock, the preset word stock comprises a built-in word stock and a user-defined word stock. The built-in word stock is a universal word stock in each field, the user-defined word stock is a special word stock in a specific field, and in the embodiment of the specification, the user-defined word stock is a legal field special word stock.

In the embodiment of the present specification, the processing priority of the error type is that the priority of the special error is higher than the priority of the normal error;

the sub-error types are subsets of the error types, the sub-error types of the common error type comprise a first sub-error type and a second sub-error type, the first sub-error type comprises homophonic errors, near-phonic errors, shape near-type errors and multi-word errors, and the second sub-error type comprises word missing errors, misplacement errors, traditional Chinese errors, repeated word errors, letter word errors, representing missing errors and punctuation superfluous errors; the sub-error types of the special error type comprise a third sub-error type and a fourth sub-error type, wherein the third sub-error comprises a heteromorphic word error, an improper emotion, a place name change error and a common sense error, and the fourth sub-error comprises an legal-disagreement fact error;

the processing rule of the sub-error type is a rule corresponding to the self-error type and used for generating error corpus.

It should be noted that, if the type of error to which the clause result of the reference text belongs is a common sense error, the determination may be performed by determining whether the reference text includes province and city names.

In the embodiment of the present disclosure, the reason for forming the first sub-error type error is mostly that the input method is implemented, so that the generated error word searches for word frequency, and the word frequency is higher than a threshold value to replace the original word. The second type of sub-errors are characterized by irregular error types, which may occur anywhere, and because of the consideration of the training model, the second type of sub-errors generally uses a transfer matrix to infer the relevance of words, and generates errors at positions with strong relevance. The third type of sub-errors are mainly realized through manually collected word lists, and in a specific embodiment, the special words can be realized by adopting national standard 'first batch special word arrangement list', the letter words can be realized by adopting common Chinese English abbreviations, the common sense errors can be inconsistent with the provincial names and the local names, for example, the local names are in the Shijia city of Sichuan province, and the local name change is formed based on the local name change generated by the historic leather. The fourth type of sub-errors is mainly writing errors of legal document and replacing some legal proper nouns.

Step S205: replacing the clause result of the reference text based on the error type and a preset error type processing rule of the clause result of the reference text until the number of the error types meets the preset number, and completing the generation of an error corpus, wherein the preset error type processing rule comprises: processing priority of error type and generation rule of sub error type.

In this embodiment of the present disclosure, the replacing the clause result of the reference text based on the error type and the preset error type processing rule to which the clause result of the reference text belongs until the number of error types meets the preset number, and completing the generation of the error corpus specifically includes:

replacing the clause with the error type of the special error type in the clause result of the reference text based on the subword table corresponding to the special error until each subword with the special error meets the corresponding preset quantity requirement, and completing the processing of the special error type to obtain a first error corpus;

after the special error type processing is completed, word segmentation is carried out on the sentence segmentation result with the error type of the common error type in the sentence segmentation result of the reference text to obtain a word segmentation result, the word segmentation result is replaced until each sub-error of the common error meets the corresponding preset quantity requirement, and then the processing of the common error type is completed to obtain a second error corpus;

the first error corpus and the second error corpus form the error corpus, and the generation of the error corpus is completed.

In this embodiment of the present disclosure, each of the sub-errors of the special error satisfies a corresponding preset number requirement that is a preset number requirement of each of the sub-errors corresponding to the special error, that is, each of the sub-errors corresponds to a corresponding preset number requirement, and the preset number requirement is preset according to the service requirement.

In the embodiment of the present disclosure, the processing priority of the special error type is higher than that of the normal error type, that is, the clause result of the reference text needs to be processed, so that the processing of the special error type is preferentially ensured, the preset number requirement of the special error type is met, and after the preset number requirement of the special error type is met, the normal error processing is performed on the clause result of the reference text. It should be noted that, in the processing of the clause results of the reference text, each clause result can only be processed with one error type, that is, each clause result can only be processed with a special error type or a common error type, and the situation that one clause result is processed with two error types cannot occur.

In this embodiment of the present disclosure, the replacing, based on a subword table corresponding to the special error, a clause with an error type of the special error type in the clause result of the reference text, specifically includes:

if the special error type is a special word error, improper emotion and change of place name, summarizing the clause result of the reference text into a clause with the special error type, and replacing any word in the clause with the special error type in the clause result of the reference text according to the special word error, improper emotion and change of place name vocabulary in a first preset word bank;

if the special error type is common sense error, summarizing clause results of the reference text into clauses with the special error type, and randomly replacing the clauses according to provinces and city names in common sense error sub-word lists in the first preset word stock, wherein the common sense error is a place name error;

if the special error type is legal, replacing the clause with the special error type in the clause result of the reference text based on a logic error rule.

In this embodiment of the present disclosure, the step of performing word segmentation on a sentence result with an error type of a common error type in the sentence result of the reference text to obtain a word segmentation result, and the step of replacing the word segmentation result specifically includes:

if the common error type is homophonic and near-phonic errors, after matching the word segmentation result with a user-defined word list in a second preset word stock, converting the word segmentation result into pinyin or near-phonic words, and matching the pinyin word list in the second preset word stock to replace the word segmentation result;

if the common error type is a shape similarity error, matching the word segmentation result with a user-defined word list in the second preset word stock based on the word segmentation result, matching the word segmentation result with a Chinese character shape similarity word list in the second preset word stock, and replacing the word segmentation result;

if the common error type is multi-word error, replacing the word segmentation result based on the error reason of the multi-word result;

if the common error type is a word missing error, randomly deleting a Chinese word based on the word segmentation result to realize the replacement of the word segmentation result;

if the common error type is misplacement error, performing front-back inversion on the four-word words in the word segmentation result by taking two words as windows, and if the four-word words do not exist in the word segmentation result, performing front-back inversion on two adjacent two-word words to realize replacement of the word segmentation result;

if the common error type is a traditional Chinese character error, a repeated character error and a letter word error, randomly selecting one character in a word segmentation result according to the second preset word bank to replace, and replacing the word segmentation result;

if the common error type is punctuation missing and punctuation superfluous, deleting or copying the position punctuation in the word segmentation result randomly according to the second preset word stock, so that the word segmentation result is replaced.

In this embodiment of the present disclosure, the converting the word segmentation result into pinyin or near-syllable word and matching the pinyin word list in the second preset word bank to replace the word segmentation result specifically includes:

replacing the word segmentation result with pinyin, matching the pinyin with a pinyin word list in the second preset word bank, and replacing the word segmentation result by taking a word with the word frequency lower than a preset value as a replacement word;

and if the replacement word is not available, converting the pinyin into the near-pronunciation word, matching the near-pronunciation word with a pinyin word list in the second preset word bank, and replacing the word segmentation result.

It should be noted that, the word segmentation result is replaced by pinyin, the pinyin is matched with the pinyin word list in the second preset word bank, the word with the word frequency lower than the preset value is used as a replacement word, and the word segmentation result is replaced, and in one embodiment, the preset value is 10%.

In the embodiment of the present specification, the error causes of the multiword error include semantic duplication and mismatching;

the replacing the word segmentation result based on the error reason of the multi-word error specifically comprises the following steps:

if the error cause of the multiword error is improper collocation, randomly selecting words with verbs or adjectives from the word segmentation result, adopting a ternary transfer matrix to associate, obtaining a prediction result, and realizing replacement of the word segmentation result after the prediction result is connected with the verbs or adjectives;

if the error cause of the multiword error is semantic repetition, splicing the synonym table of the second preset word stock to replace the word segmentation result.

In this embodiment of the present disclosure, if the specific error type is not legal, replacing, based on a logic error rule, a clause with the error type being the specific error type in the clause result of the reference text, specifically includes:

if the special error type is not legal, replacing the clause with the error type of the special error type in the clause result of the reference text by any one of modifying the special error type into a thumbnail, modifying the proper noun, dividing by using a pause number, using a connection number, using a comma, using a bracket, replacing by using a verb and randomly replacing by a graduated word.

Specifically, if the special error type is not legal, replacing legal nouns in the clauses with the special error type in the clause result of the reference text with abbreviations;

or modifying legal proper nouns in clauses with the error types of the special error types in the clause results of the reference text;

or when a plurality of signature numbers or quotation numbers exist in the clause with the error type of the special error type in the clause result of the reference text and are used in parallel, dividing the clause by using a pause number;

or when the error type in the clause result of the reference text is the existing numerical value or the start-stop year limit in the clause of the special error type, using a connection number;

or when parallel sentences exist in the clause with the error type of the special error type in the clause result of the reference text, comma collar is used;

or the clauses with the same error type in the clause results of the reference text are provided with brackets in the same form, the Arabic numerals are used for indicating improper point numbers in sequence, brackets are used for indicating the years of the hair, punctuation marks are used after the names are attached, the dates in the text are non-existing dates, verb replacement with similar semantic meaning of word vectors is found, random replacement of the words is obtained, and the words are replaced randomly.

In the embodiment of the present disclosure, if the special error type is a common sense error, the provincial collocation is not appropriate by replacing the provincial collocation in the reference text clause.

It should be noted that, whether the error is a normal error type or a special error type, in the process of generating the error corpus, the number of error types needs to be ensured to meet the preset number, so that in the process of generating the error corpus, the number of errors corresponding to the error types needs to be counted.

In order to further understand the error corpus generation method provided by the present description, fig. 3 is a frame diagram of the error corpus generation method provided by the embodiment of the present description. As shown in fig. 3, a first preset word stock and a second preset word stock are utilized, a reference text is input, an article is traversed to carry out sentence segmentation, a special error processor is utilized to carry out special error processing based on sentence segmentation results until each sub-error of the special error meets the corresponding preset number requirement, and then a common error processor is utilized to carry out common error processing on sentences which are not subjected to the special error processing in the reference text, so that an error corpus is obtained.

By adopting the method for generating the error corpus provided by the embodiment of the specification, the accuracy of the error corpus can be improved, and the comprehensiveness of the error corpus can be improved.

The foregoing details a method for generating an error corpus, and accordingly, the present disclosure also provides a device for generating an error corpus, as shown in fig. 4. Fig. 4 is a schematic diagram of a device for generating an error corpus according to an embodiment of the present disclosure, where the generating device includes:

an acquisition module 401 for acquiring a reference text;

the judging module 403 determines an error type to which the clause result of the reference text belongs based on the sentence pattern requirement satisfied by the clause result of the reference text, where the error type includes a special error type and a common error type;

the corpus generating module 405 replaces the clause result of the reference text based on the error type to which the clause result of the reference text belongs and a preset error type processing rule, until the number of the error types meets a preset number, and completes the generation of the error corpus, where the preset error type processing rule includes: processing priority of error type and generation rule of sub error type.

It should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the statement "comprises" or "comprising" an element defined by … … does not exclude the presence of other identical elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims

1. The generation method of the error corpus is characterized by comprising the following steps:

acquiring a reference text;

2. The generating method according to claim 1, wherein the determining the error type to which the clause result of the reference text belongs based on the sentence pattern requirement satisfied by the clause result of the reference text specifically includes:

3. The generation method according to claim 1, wherein the processing priority of the error type is that the priority of the special error is higher than that of the normal error;

4. The generating method according to claim 1, wherein the replacing the sentence result of the reference text based on the error type to which the sentence result of the reference text belongs and a preset error type processing rule until the number of the error types meets a preset number, and completing the generation of the error corpus, specifically includes:

5. The method of claim 4, wherein the replacing the clause of the reference text with the clause of the specific error type based on the subword table corresponding to the specific error in the clause result of the reference text comprises:

6. The method of claim 4, wherein the step of performing word segmentation on the sentence result with the error type being the common error type in the sentence result of the reference text to obtain a word segmentation result, and the step of replacing the word segmentation result specifically comprises:

7. The method of generating as claimed in claim 6, wherein said converting the word segmentation result into pinyin or near-syllable words matches with a pinyin word list in the second preset word bank, and replacing the word segmentation result specifically includes:

8. The method of generating of claim 6, wherein the error causes of the multiword error include semantic duplication and mismatching;

9. The method as claimed in claim 4, wherein if the special error type is legal, replacing the clause with the error type being the special error type in the clause result of the reference text based on the logic error rule, specifically comprising:

10. A generation device of an error corpus, characterized in that the generation device comprises:

the acquisition module acquires a reference text;