CN112765957A

CN112765957A - Punctuation-free text clause dividing method

Info

Publication number: CN112765957A
Application number: CN202110220106.2A
Authority: CN
Inventors: 邓强; 朱西华; 孙力泽
Original assignee: China Telecom Wanwei Information Technology Co Ltd
Current assignee: China Telecom Wanwei Information Technology Co Ltd
Priority date: 2021-02-27
Filing date: 2021-02-27
Publication date: 2021-05-07

Abstract

The invention relates to the technical field of artificial intelligence in the field of computers, in particular to a punctuation-free text clause dividing method, which comprises the following steps of S1: preprocessing Chinese text data, extracting the text data to be processed from a database, determining the identifier category of a segmented text, cleaning the data, and removing special characters in the text; s2: performing data labeling on the processed data according to the label, adding a corpus behind each character, arranging one line for each character in a single corpus, and separating the corpus from the character by using a blank space; s3: putting the marked text into a BERT Chinese pre-training language model, and converting the Chinese characters into Embedding space word vectors; s4: putting the extracted Embedding space vector into a BilSTM to obtain a vector h containing context information, and mapping the dimension of h into a vector with the dimension as the number of labels after passing through a dropout layer; s5: and (5) outputting the label sequence with the highest score after the result generated in the step (S4) passes through a CRF layer, and continuously training and iterating the steps until the model converges.

Description

Punctuation-free text clause dividing method

Technical Field

The invention relates to the technical field of artificial intelligence in the field of computers, in particular to a punctuation-free text clause dividing method.

Background

With the rise of the internet, the world also rapidly enters the information era, more and more people start to contact the internet, and chatting, shopping and the like also become the normalcy. The Chinese has nearly billions of netizens, most of which have low cultural degree, and in the process of leaving messages on websites such as news media, a lot of message contents without any division marks are generated, which causes great time cost and labor cost for auditors. In order to reduce the labor and time cost, it is important to intelligently parse such message contents through computer technology.

At present, some research practices are provided in the technical field, including that an expansion convolution network and a recurrent neural network are used for realizing a clause function, but the expansion convolution network is difficult to consider context semantic information of a text, and information loss is easily caused; the recurrent neural network consumes time in the training process, and easily causes the problems of gradient disappearance and the like, thereby increasing the difficulty of model training and reducing the accuracy of model prediction.

Disclosure of Invention

In order to solve the problems in the prior art, the invention aims to provide a punctuation-free text clause dividing method, which overcomes the problems of time consumption in training, disappearance of network gradient, missing of context semantics and the like in the prior art, and provides a novel intelligent clause solving method, aiming at solving the problem that no clause mark exists in a Chinese text, facilitating the review of reviewers, improving the review efficiency, reducing the time and labor cost and solving the problems in the prior art.

The technical scheme adopted by the invention is as follows: a punctuation-free text clause dividing method comprises the following steps:

s1: preprocessing Chinese text data, extracting the text data to be processed from a database, determining the identifier category of a segmented text, cleaning the data, and removing special characters in the text;

s2: performing data labeling on the processed data according to the label, adding a corpus behind each character, arranging one line for each character in a single corpus, and separating the corpus from the character by using a blank space;

s3: putting the marked text into a BERT Chinese pre-training language model, and converting the Chinese characters into Embedding space word vectors;

s4: putting the extracted Embedding space vector into a BilSTM to obtain a vector h containing context information, and mapping the dimension of h into a vector with the dimension as the number of labels after passing through a dropout layer;

s5: and (5) outputting the label sequence with the highest score after the result generated in the step (S4) passes through a CRF layer, and continuously training and iterating the steps until the model converges.

The preferred steps S4 and S5, i.e., the optimization objective function of the BilSTM + CRF structure, are:

。

in step S5, different parameters are set to train the model, the trained model is tested on the test set, and the model with the highest F1 value is selected as the final model.

The specific characters in the text described in the preferred step S1 include: punctuation marks, spaces, linefeeds, and the like.

The Embedding space word vector in step S3 is preferably the unit sum of 3 embedded features, which is the word vector, sentence vector, and position vector of the word itself.

The vector h in step S4 is preferably a concatenation of the above information vector h-before and the below information vector h-after.

The preferred parameters mainly include learning rate and batch _ size.

Preferably, the method further comprises step S6: and predicting, namely inputting the predicted sequence into a model, decoding by using a Viterbi algorithm to obtain a label sequence, comparing the label sequence with each position of the input sequence, and adding separators at the corresponding positions according to the label type of the punctuation marks to finish automatic clause division of the input sequence.

The invention has the beneficial effects that:

1. the training is efficient and convenient, and the training time is saved;

2. the method is more accurate, and the problems of network gradient disappearance and context semantic deletion are avoided;

3. promote staff work efficiency, reduce time and human cost.

Drawings

FIG. 1 is a network architecture diagram illustrating a data annotation process according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an embodiment of the present invention after data annotation is completed;

FIG. 3 is a block diagram of the BERT language model of the present invention;

fig. 4 is a network architecture diagram of the BILSTM of the present invention.

Detailed Description

The invention is described in detail below with reference to the following figures and detailed description:

BERT: a natural language processing pre-training model of Bidirective Encoder responses from transformations Google open source;

data cleaning: a process of re-examining and verifying data, with the purpose of deleting duplicate information, correcting existing errors, and providing data consistency;

data annotation: a behavior of processing the artificial intelligent learning data by a data annotator by means of an annotation tool or a programming code;

BilSTM: bidirectional Short Term Memory networks (Long Short Term Memory networks), a special form of RNN, has the ability to learn Long Term dependence

CRF: conditional random field, a mathematical algorithm, based on probabilistic graphical models following markov properties

The viterbi algorithm: also called viterbi algorithm, a path-optimal search algorithm.

Example (b):

to' good you, eat

"the content of the present invention is described in detail as follows:

text data to be processed is extracted from a database, and identifier categories of the segmented text are determined. The method sets comma, period, question mark, colon mark, exclamation mark, semicolon and sentence beginning and ending identifiers.

And (3) removing special characters (such as punctuation marks, spaces, line feed characters and the like) in the text by cleaning the data, and labeling the cleaned text data with data. One line of each character in a single corpus, spaces are used for separating the corpora, the beginning of the text is marked by BEG, the front character of the punctuation mark is marked by B and the punctuation mark category is added, if comma English word is comma, the first three letters are combined with B to form B _ com, similarly, a character after the punctuation mark is combined with E to form E _ com, if the end of the text is a query word marked by QUE, the character is 'you good' and has eaten

"labeling procedure as shown in FIG. 1, using EXC for exclamation words, using EEG for identification if periods, and finally saving as txt file format as shown in FIG. 2, and dividing the test set training set by two eight proportions, wherein 20% is used as test data set and 80% is used as training data set.

The labeled text is placed into a BERT Chinese pre-training language model, and the Chinese characters are converted into Embedding space word vectors, as shown in FIG. 3.

The BERT embedding coding vector is a unit sum of 3 embedded features, and is a word vector, a sentence vector and a position vector of a word. As the input vector for BiLSTM.

And putting the extracted Embedding space vector into a bidirectional long-time memory network to obtain a vector h containing context information, wherein h is the splicing of the above information vector h-before and the below information vector h-after. And mapping the dimension of h into a vector with the dimension as the number of the labels after passing through a dropout layer. The network structure of BILSTM is shown in FIG. 4:

and outputting a label sequence with the highest score after the generated result passes through a CRF layer, and continuously training and iterating the steps until the model converges.

In order to optimize the model effect, different parameters (such as learning rate, batch times, lstm number and the like) are set for model training, the trained models are tested on a test set, and the model with the highest F1 value is selected as a final model.

And (4) testing and comparing, wherein the data in the real scene (the user message and comment data) is used as the training data and the testing data of the model. The maximum length max _ seq _ length of the input sequence is set to 192 according to the distribution condition of the real data sample length and the performance of a machine GPU (tesla P100, video memory 12G), the total data amount is 20 ten thousand, the number of layers of a Transformer is 12, the dimensionality of a feature layer is 768, an optimizer uses Adam, the dropout is set to 0.5, clip is 0.5, and the results of other parameter settings, iteration times and evaluation index F1 values are as follows:

the parameter adjustment mainly aims at the learning rate and the batch _ size, the learning rate directly influences the convergence state of the model, and the batch _ size influences the generalization ability of the model, and the parameters influence the performance of the model most importantly. The first training learning rate was set to 10-5, the number of batches was 36 per batch, the LSTM unit size was 64, the number of iterations and the F1 values were as follows:

adjust parameters, increase batch _ size, otherwise do not change:

the learning rate is increased, batch _ size is set to 36, and the other parameters are unchanged:

change batch _ size, the learning rate remains 10-4:

repeated tests show that the model is more stable and has stronger generalization capability due to the large learning rate and the large batch times.

And predicting, namely inputting the predicted sequence into a model, decoding by using a Viterbi (viterbi) algorithm to obtain a label sequence, comparing the label sequence with each position of the input sequence, and adding separators at corresponding positions according to the label type of punctuation marks to finish automatic clause division of the input sequence.

The network of the invention is composed of three parts, one is a BERT coding network, and characters are coded into an Embedding sensor; one is a BilSTM network, and vectors containing context information are obtained after a sensor passes through a BilSTM layer; and the last one is CRF, and the vector outputs a label sequence with the highest score after passing through a CRF layer.

The optimization objective function of the BilSTM + CRF structure is as follows:

(1)

wherein:

(2)

wherein X is an input sequence of the formula:

(3)

y is the tag corresponding to the input sequence, as follows:

(4)

thereby defining x->The scoring formula of y is formula (2), wherein A_i,jIs a transition score matrix representing the score, y, of the transition from label i to label j₀And y_nP is the fractional matrix of the BilSTM output, P is_i,yiA score representing that the ith word in the input sequence corresponds to the jth tag.

Given an input sequence X, the probability of a tag sequence Y is obtained:

(5)

wherein Y is_xAll possible tag sequences corresponding to the input sequence X, i.e. each tag sequence corresponding to a sentence, contain a score and a probability. Our goal is to maximize the probability that the input sequence X corresponds to a true tag Y. I.e. minimizing the loss function, the formula is transformed as follows:

(6)

and carrying out simultaneous log removal on two sides of the formula 6 to obtain the formula 1.

Claims

1. A punctuation-free text clause dividing method is characterized by comprising the following steps:

2. The punctual-free text clause method according to claim 1, characterized in that: the optimization objective function of steps S4 and S5, i.e. the structure of BilSTM + CRF, is:

。

3. the punctual-free text clause method according to claim 1, characterized in that: in step S5, in order to optimize the model effect, different parameters are set for model training, the trained model is tested on the test set, and the model with the highest F1 value is selected as the final model.

4. The punctual-free text clause method according to claim 1, characterized in that: the special characters in the text described in step S1 include: punctuation marks, spaces, linefeeds, and the like.

5. The punctual-free text clause method according to claim 1, characterized in that: the Embedding space word vector in step S3 is the unit sum of 3 embedded features, which are the word vector, sentence vector and position vector of the word itself.

6. The punctual-free text clause method according to claim 1, characterized in that: the vector h in step S4 is the concatenation of the above information vector h-before and the below information vector h-after.

7. The punctual-free text clause method according to claim 3, characterized in that: the parameters mainly comprise learning rate and batch _ size.

8. A punctual-free text clause method according to any one of claims 1 to 7, characterized in that:

further comprising step S6: and predicting, namely inputting the predicted sequence into a model, decoding by using a Viterbi algorithm to obtain a label sequence, comparing the label sequence with each position of the input sequence, and adding separators at the corresponding positions according to the label type of the punctuation marks to finish automatic clause division of the input sequence.