CN112612892A

CN112612892A - Special field corpus model construction method, computer equipment and storage medium

Info

Publication number: CN112612892A
Application number: CN202011589591.2A
Authority: CN
Inventors: 顾嘉晟; 岳小龙; 高翔; 纪达麒; 陈运文
Original assignee: Daguan Data Chengdu Co ltd
Current assignee: Daguan Data Chengdu Co ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-04-06
Anticipated expiration: 2040-12-29
Also published as: CN112612892B

Abstract

The invention discloses a method for constructing a corpus model in a special field, computer equipment and a storage medium, wherein the method comprises the following steps: step one, corpus collection and pretreatment: obtaining sufficient pure unsupervised corpora through data cleaning; step two, analyzing word frequency and inverse text frequency index: identifying words with higher importance degree in the pure unsupervised corpus by a TF-IDF statistical method; step three, data enhancement: enhancing sentences in which the high-frequency words extracted in the step two are located; step four, training a language model: modeling the pure unsupervised corpora after being enhanced in the third step through an XLNET model to generate a special field corpus model. According to the invention, the classification task accuracy, the recall rate and the F1 value can be obviously improved through the special field corpus model generated by the special corpus after data enhancement. The method can greatly shorten the process of pre-training the language model and greatly reduce the resource consumption in the pre-training process.

Description

Special field corpus model construction method, computer equipment and storage medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a corpus model construction method in the special field, computer equipment and a storage medium.

Background

A large amount of text processing work exists in daily operation of enterprises, the types of documents are quite various, and each type of document has relatively fixed format, specification, fixed collocation and the like. A large number of application scenarios of text natural language processing exist in daily text document processing work of enterprises, such as text word segmentation, document format type classification, text emotion analysis, key information extraction, contract document review, document similarity calculation and the like.

Currently, in both academic and industrial fields, most NLP tasks pass through pre-training language models, such as n-grams, berts, GPTs, and their variants. The pre-training idea is that parameters in the deep neural network are not initialized randomly, and pre-training is completed through a task of a language model, so that a downstream NLP task is completed through a transfer learning idea.

However, in the actual domain-specific document processing, there is a problem that the general corpus and the training model do not include enough language patterns such as the term specific to the industry, the fixed grammar collocation of the specific document, and the like in the corpus used for pre-training, so that the complete key information cannot be accurately hit in the downstream task such as the key information extraction task. In text classification or key information extraction tasks in, for example, financial-specific fields, language models generated by general corpora tend to have some bias in semantic understanding.

For another example, in a text word segmentation task, a certain preset keyword can only be roughly truncated in a simpler manner of adding a keyword dictionary.

Disclosure of Invention

In order to solve various defects of a general language model in a document NLP task in a specific industry field, the invention provides a method for constructing a corpus model in a special field, computer equipment and a storage medium. The special field corpus model generated in the invention can be used for multiple document types and multiple NLP tasks, and can reduce the fine tuning time of a downstream model, thereby achieving the purpose of reducing the resource consumption of a server.

The technical scheme of the invention is as follows:

a method for constructing a corpus model in a special field comprises the following steps:

step one, corpus collection and pretreatment: obtaining sufficient pure unsupervised corpora through data cleaning;

step two, analyzing word frequency and inverse text frequency index: identifying words with higher importance degree in the pure unsupervised corpus by a TF-IDF statistical method, removing common words by using a reverse frequency index in the TF-IDF statistical method, and using the words with higher word frequency in the remaining words as high-frequency words of the current text or high-frequency words in the corpus of the special field;

step three, data enhancement: enhancing the sentences in which the high-frequency words extracted in the step two are located, wherein the enhancing method comprises the following steps: copying the paragraph where the high-frequency word is located, and randomly inserting the copied paragraph into any position of the pure unsupervised corpus;

step four, training a language model: modeling the pure unsupervised corpora enhanced in the third step through an XLNET model to generate a special field corpus model; and when the special field corpus model is trained, segmenting the corpus according to a segmentation model generated by the special field corpus model, and performing iterative training again to improve the language model.

Further, in the first step, the data cleaning includes analyzing and extracting characters in the mass PDF files, and the analyzing manner includes:

the continuity of text content is kept, and the text content is divided through paragraphs, so that the context in the paragraphs is consistent;

the document title is used as a single paragraph, and the paragraph title in the body is used as a single paragraph to ensure the continuity of the upper and lower sentences.

Further, for the contents in two or more columns, if the directly read contents are coherent semantics, the contents are used as pure corpora, otherwise, the contents are discarded.

Further, the parsing method further includes:

text content is converted into complex characters and simple characters, and all complex characters are converted into simple characters;

author information, directory content, pictures, charts, tables, headers, and footer information are removed.

Further, in the second step, the occurrence frequency of the words in the current text, namely word frequency, is calculated, the occurrence frequency of each word in all the texts, namely inverse text frequency, is calculated, and finally common words are filtered out through the product of the word frequency and the inverse text frequency, so that important words for each file are reserved.

Further, in the third step, an enhancement amplitude is set, and if the pure unsupervised corpus obtained in the first step is smaller, the enhancement amplitude is set to be 3-5 times of the high-frequency word paragraph.

Furthermore, in the fourth step, the context word is used for predicting the next word, and the language material model in the special field is pre-trained in the forward direction and the backward direction.

Furthermore, for a given text sequence, prediction is performed through the preamble or the postamble of each token, and then the probability of each time step is multiplied together to be used as an objective function of the model.

A computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the method for constructing the language material model of the proprietary field when executing the computer program.

A computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above-described method for constructing a domain-specific corpus model.

The invention has the beneficial effects that:

1. in the text classification or key information extraction task in the special field, the language model generated by the universal corpus has certain deviation in semantic understanding, and the accuracy, recall rate and F1 value of the classification task can be obviously improved through the special field corpus model generated by the data-enhanced special corpus.

2. The language model pre-training process is usually a process of consuming a large amount of display card resources, and the order of magnitude of the language material in the special field is far smaller than that of the general language material.

3. Because the language material model of the special field generated by the invention has stronger semantic understanding capability to the text in the special field, the downstream NLP task can realize higher accuracy on a smaller training set, thereby achieving the purpose of reducing the training cost.

Drawings

Fig. 1 is a flowchart of a method for constructing a corpus model in a proprietary domain according to embodiment 2 of the present invention.

Detailed Description

In order to more clearly understand the technical features, objects, and effects of the present invention, specific embodiments of the present invention will now be described. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

The embodiment provides a method for constructing a corpus model in a special field, which comprises the following steps:

step one, corpus collection and pretreatment

Many industries, such as the financial industry, due to informatization disclosure requirements, can find large numbers of naturally published PDF-formatted files on the web, of types including, but not limited to, bond recruitment specifications, treatise specifications, investment fund contracts, equity pledges.

In this step, the characters in the mass PDF need to be analyzed and extracted to obtain sufficient pure unsupervised corpus, and the specific analysis mode includes:

(1) the continuity of text content is kept, and the text content is divided through paragraphs, so that the context in the paragraphs is consistent;

(2) text content is converted into complex characters and simple characters, and all complex characters are converted into simple characters;

(3) the document title is used as an independent paragraph, and the paragraph title in the body is used as an independent paragraph to ensure the continuity of the upper sentence and the lower sentence;

(4) author information, directory content, pictures, charts, tables, headers, and footer information are removed.

(5) For double-column or multi-column contents, if the directly read contents are coherent semantics, the contents are used as pure corpora, otherwise, the contents are discarded.

Step two, analyzing word frequency and inverse text frequency index

Compared with the number of the general linguistic data, the number of the special linguistic data is smaller by one order of magnitude, so that the data enhancement is carried out on the special sentence patterns, the special nouns and the like in the linguistic data in the third step and the third step in a data enhancement mode, and the model is more familiar with the use.

The method comprises the steps of obtaining sufficient pure unsupervised linguistic data through data cleaning in the first step, identifying words with high importance degree in the pure unsupervised linguistic data through a TF-IDF statistical method in the step, removing common words through reverse frequency indexes in the TF-IDF statistical method, using words with high word frequency in the remaining words as high-frequency words of a current text, and enabling the words to be used as the high-frequency words appearing in the special field linguistic data to be specially processed.

Specifically, the occurrence frequency of the words in the current text, namely word frequency, is calculated, the occurrence frequency of each word in all the texts, namely inverse text frequency, is calculated, and common words are filtered out by the product of the word frequency and the inverse text frequency, so that important words for each file are retained.

Step three, data enhancement

And enhancing the sentences in which the high-frequency words extracted in the step two are located, wherein the enhancing method comprises the following steps: and (4) copying the paragraph of the high-frequency word, and randomly inserting the copied paragraph into any position of the pure unsupervised corpus. In the step, an enhancement amplitude needs to be set, and if the pure unsupervised corpus obtained in the step one is small, the enhancement amplitude is set to be 3-5 times of the high-frequency word paragraph, so that a better language model effect is achieved.

Step four, training the language model

The language model pre-training adopts an autoregressive pre-training method XLNET model, and in the step, the XLNET model is used for modeling the enhanced pure unsupervised linguistic data in the step three to generate a special field linguistic data model.

The auto-regressive language model (AR) of XLNet uses context words to predict the next word, pre-training the language model in both forward and backward directions. Meanwhile, XLNET adopts an Attention Mask mechanism, a part of words are dropped from random Mask in the transform, and a transform XL mechanism is added, so that the problems of pre-training and fine-tuning effect distortion during long text language modeling can be solved by XLNET. When the language model training is completed, better word segmentation can be performed on the corpus according to the word segmentation model generated by the model, so that the accuracy of TF-IDF is improved, iterative training is performed again, and the language model is improved.

In the XLNET autoregressive pre-training process, the forward process predicts the current word through the preceding word of a word in the current sentence, and the backward process predicts the current word through all the following words of the word in the sentence. For a given text sequence, the AR model is mainly predicted by the preamble or postamble of each token, and finally the probability of each time step is multiplied as the objective function of the model. Secondly, random arrangement and a double-flow attention mechanism are introduced to a Transformer XL mechanism, so that the training resource consumption of a downstream NLP task during fine adjustment is reduced.

Correspondingly, the embodiment provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the method for constructing the corpus model in the proprietary domain when executing the computer program.

In addition, the embodiment further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the method for constructing the corpus model in the proprietary domain.

Example 2

This example is based on example 1:

assuming that a media user has acquired a large number of stock-related files using a crawler, it is desirable to distinguish which stock each document belongs to by a classification model and analyze whether it is a profit or a profit.

Correspondingly, the embodiment provides a method for constructing a corpus model in a proprietary domain, as shown in fig. 1, including the following steps:

step 1, analyzing all files, extracting pure text information in PDF, and cleaning and preprocessing the obtained text;

step 2, performing word frequency and inverse frequency analysis on the corpus by using a TF-IDF statistical model to obtain special words in the financial field or high-frequency words in a specific text;

step 3, finding out the paragraph according to the high-frequency words in the step 2, and copying any part of the paragraph in the text for 2 times;

step 4, pre-training a language model by taking the linguistic data generated in the step 3 as an input linguistic data of XLNET;

and 5, if the accuracy of the language model in the step 4 in the downstream task is not obviously improved compared with the accuracy of the general corpus language model, the language model generated in the step 4 can be considered to be used for finely adjusting the word segmentation model, the TF-IDF analysis in the step 2 is repeated, the data enhancement operation is repeated, and the iterative training is completed to generate a new language model.

The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for constructing a corpus model in a special field is characterized by comprising the following steps:

2. The method for constructing the corpus model in the proprietary domain according to claim 1, wherein in the first step, the data cleaning comprises analyzing and extracting characters in the mass PDF files, and the analyzing mode comprises:

3. The method as claimed in claim 2, wherein for two or more columns of content, if the directly read content is coherent semantic, it is used as pure corpus, otherwise it is discarded.

4. The method according to claim 2, wherein the parsing further comprises:

5. The method for constructing a corpus model of the proprietary domain according to claim 1, wherein in the second step, the occurrence frequency of words in the current text, i.e., word frequency, is calculated, the occurrence frequency of each word in all texts, i.e., inverse text frequency, is calculated, and finally common words are filtered out by the product of the word frequency and the inverse text frequency, and important words for each document are retained.

6. The method according to claim 1, wherein in step three, an enhancement amplitude is set, and if the pure unsupervised corpus obtained in step one is small, the enhancement amplitude is set to copy the paragraph where the high-frequency word is located by 3-5 times.

7. The method according to claim 1, wherein in step four, the next word is predicted by using the context word, and the domain specific corpus model is pre-trained in both forward and backward directions.

8. The method as claimed in claim 7, wherein the prediction is performed for a given text sequence by the preamble or postamble of each token, and the probability of each time step is multiplied as the objective function of the model.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of the method according to any of claims 1-8.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.