CN112765957A - Punctuation-free text clause dividing method - Google Patents

Punctuation-free text clause dividing method Download PDF

Info

Publication number
CN112765957A
CN112765957A CN202110220106.2A CN202110220106A CN112765957A CN 112765957 A CN112765957 A CN 112765957A CN 202110220106 A CN202110220106 A CN 202110220106A CN 112765957 A CN112765957 A CN 112765957A
Authority
CN
China
Prior art keywords
vector
text
model
data
free text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110220106.2A
Other languages
Chinese (zh)
Inventor
邓强
朱西华
孙力泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Wanwei Information Technology Co Ltd
Original Assignee
China Telecom Wanwei Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Wanwei Information Technology Co Ltd filed Critical China Telecom Wanwei Information Technology Co Ltd
Priority to CN202110220106.2A priority Critical patent/CN112765957A/en
Publication of CN112765957A publication Critical patent/CN112765957A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of artificial intelligence in the field of computers, in particular to a punctuation-free text clause dividing method, which comprises the following steps of S1: preprocessing Chinese text data, extracting the text data to be processed from a database, determining the identifier category of a segmented text, cleaning the data, and removing special characters in the text; s2: performing data labeling on the processed data according to the label, adding a corpus behind each character, arranging one line for each character in a single corpus, and separating the corpus from the character by using a blank space; s3: putting the marked text into a BERT Chinese pre-training language model, and converting the Chinese characters into Embedding space word vectors; s4: putting the extracted Embedding space vector into a BilSTM to obtain a vector h containing context information, and mapping the dimension of h into a vector with the dimension as the number of labels after passing through a dropout layer; s5: and (5) outputting the label sequence with the highest score after the result generated in the step (S4) passes through a CRF layer, and continuously training and iterating the steps until the model converges.

Description

Punctuation-free text clause dividing method
Technical Field
The invention relates to the technical field of artificial intelligence in the field of computers, in particular to a punctuation-free text clause dividing method.
Background
With the rise of the internet, the world also rapidly enters the information era, more and more people start to contact the internet, and chatting, shopping and the like also become the normalcy. The Chinese has nearly billions of netizens, most of which have low cultural degree, and in the process of leaving messages on websites such as news media, a lot of message contents without any division marks are generated, which causes great time cost and labor cost for auditors. In order to reduce the labor and time cost, it is important to intelligently parse such message contents through computer technology.
At present, some research practices are provided in the technical field, including that an expansion convolution network and a recurrent neural network are used for realizing a clause function, but the expansion convolution network is difficult to consider context semantic information of a text, and information loss is easily caused; the recurrent neural network consumes time in the training process, and easily causes the problems of gradient disappearance and the like, thereby increasing the difficulty of model training and reducing the accuracy of model prediction.
Disclosure of Invention
In order to solve the problems in the prior art, the invention aims to provide a punctuation-free text clause dividing method, which overcomes the problems of time consumption in training, disappearance of network gradient, missing of context semantics and the like in the prior art, and provides a novel intelligent clause solving method, aiming at solving the problem that no clause mark exists in a Chinese text, facilitating the review of reviewers, improving the review efficiency, reducing the time and labor cost and solving the problems in the prior art.
The technical scheme adopted by the invention is as follows: a punctuation-free text clause dividing method comprises the following steps:
s1: preprocessing Chinese text data, extracting the text data to be processed from a database, determining the identifier category of a segmented text, cleaning the data, and removing special characters in the text;
s2: performing data labeling on the processed data according to the label, adding a corpus behind each character, arranging one line for each character in a single corpus, and separating the corpus from the character by using a blank space;
s3: putting the marked text into a BERT Chinese pre-training language model, and converting the Chinese characters into Embedding space word vectors;
s4: putting the extracted Embedding space vector into a BilSTM to obtain a vector h containing context information, and mapping the dimension of h into a vector with the dimension as the number of labels after passing through a dropout layer;
s5: and (5) outputting the label sequence with the highest score after the result generated in the step (S4) passes through a CRF layer, and continuously training and iterating the steps until the model converges.
The preferred steps S4 and S5, i.e., the optimization objective function of the BilSTM + CRF structure, are:
Figure 100002_DEST_PATH_IMAGE001
in step S5, different parameters are set to train the model, the trained model is tested on the test set, and the model with the highest F1 value is selected as the final model.
The specific characters in the text described in the preferred step S1 include: punctuation marks, spaces, linefeeds, and the like.
The Embedding space word vector in step S3 is preferably the unit sum of 3 embedded features, which is the word vector, sentence vector, and position vector of the word itself.
The vector h in step S4 is preferably a concatenation of the above information vector h-before and the below information vector h-after.
The preferred parameters mainly include learning rate and batch _ size.
Preferably, the method further comprises step S6: and predicting, namely inputting the predicted sequence into a model, decoding by using a Viterbi algorithm to obtain a label sequence, comparing the label sequence with each position of the input sequence, and adding separators at the corresponding positions according to the label type of the punctuation marks to finish automatic clause division of the input sequence.
The invention has the beneficial effects that:
1. the training is efficient and convenient, and the training time is saved;
2. the method is more accurate, and the problems of network gradient disappearance and context semantic deletion are avoided;
3. promote staff work efficiency, reduce time and human cost.
Drawings
FIG. 1 is a network architecture diagram illustrating a data annotation process according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an embodiment of the present invention after data annotation is completed;
FIG. 3 is a block diagram of the BERT language model of the present invention;
fig. 4 is a network architecture diagram of the BILSTM of the present invention.
Detailed Description
The invention is described in detail below with reference to the following figures and detailed description:
BERT: a natural language processing pre-training model of Bidirective Encoder responses from transformations Google open source;
data cleaning: a process of re-examining and verifying data, with the purpose of deleting duplicate information, correcting existing errors, and providing data consistency;
data annotation: a behavior of processing the artificial intelligent learning data by a data annotator by means of an annotation tool or a programming code;
BilSTM: bidirectional Short Term Memory networks (Long Short Term Memory networks), a special form of RNN, has the ability to learn Long Term dependence
CRF: conditional random field, a mathematical algorithm, based on probabilistic graphical models following markov properties
The viterbi algorithm: also called viterbi algorithm, a path-optimal search algorithm.
Example (b):
to' good you, eat
Figure 634289DEST_PATH_IMAGE003
"the content of the present invention is described in detail as follows:
text data to be processed is extracted from a database, and identifier categories of the segmented text are determined. The method sets comma, period, question mark, colon mark, exclamation mark, semicolon and sentence beginning and ending identifiers.
And (3) removing special characters (such as punctuation marks, spaces, line feed characters and the like) in the text by cleaning the data, and labeling the cleaned text data with data. One line of each character in a single corpus, spaces are used for separating the corpora, the beginning of the text is marked by BEG, the front character of the punctuation mark is marked by B and the punctuation mark category is added, if comma English word is comma, the first three letters are combined with B to form B _ com, similarly, a character after the punctuation mark is combined with E to form E _ com, if the end of the text is a query word marked by QUE, the character is 'you good' and has eaten
Figure 879326DEST_PATH_IMAGE003
"labeling procedure as shown in FIG. 1, using EXC for exclamation words, using EEG for identification if periods, and finally saving as txt file format as shown in FIG. 2, and dividing the test set training set by two eight proportions, wherein 20% is used as test data set and 80% is used as training data set.
The labeled text is placed into a BERT Chinese pre-training language model, and the Chinese characters are converted into Embedding space word vectors, as shown in FIG. 3.
The BERT embedding coding vector is a unit sum of 3 embedded features, and is a word vector, a sentence vector and a position vector of a word. As the input vector for BiLSTM.
And putting the extracted Embedding space vector into a bidirectional long-time memory network to obtain a vector h containing context information, wherein h is the splicing of the above information vector h-before and the below information vector h-after. And mapping the dimension of h into a vector with the dimension as the number of the labels after passing through a dropout layer. The network structure of BILSTM is shown in FIG. 4:
and outputting a label sequence with the highest score after the generated result passes through a CRF layer, and continuously training and iterating the steps until the model converges.
In order to optimize the model effect, different parameters (such as learning rate, batch times, lstm number and the like) are set for model training, the trained models are tested on a test set, and the model with the highest F1 value is selected as a final model.
And (4) testing and comparing, wherein the data in the real scene (the user message and comment data) is used as the training data and the testing data of the model. The maximum length max _ seq _ length of the input sequence is set to 192 according to the distribution condition of the real data sample length and the performance of a machine GPU (tesla P100, video memory 12G), the total data amount is 20 ten thousand, the number of layers of a Transformer is 12, the dimensionality of a feature layer is 768, an optimizer uses Adam, the dropout is set to 0.5, clip is 0.5, and the results of other parameter settings, iteration times and evaluation index F1 values are as follows:
the parameter adjustment mainly aims at the learning rate and the batch _ size, the learning rate directly influences the convergence state of the model, and the batch _ size influences the generalization ability of the model, and the parameters influence the performance of the model most importantly. The first training learning rate was set to 10-5, the number of batches was 36 per batch, the LSTM unit size was 64, the number of iterations and the F1 values were as follows:
Figure DEST_PATH_IMAGE004
adjust parameters, increase batch _ size, otherwise do not change:
Figure 534429DEST_PATH_IMAGE005
the learning rate is increased, batch _ size is set to 36, and the other parameters are unchanged:
Figure DEST_PATH_IMAGE006
change batch _ size, the learning rate remains 10-4:
Figure DEST_PATH_IMAGE007
repeated tests show that the model is more stable and has stronger generalization capability due to the large learning rate and the large batch times.
And predicting, namely inputting the predicted sequence into a model, decoding by using a Viterbi (viterbi) algorithm to obtain a label sequence, comparing the label sequence with each position of the input sequence, and adding separators at corresponding positions according to the label type of punctuation marks to finish automatic clause division of the input sequence.
The network of the invention is composed of three parts, one is a BERT coding network, and characters are coded into an Embedding sensor; one is a BilSTM network, and vectors containing context information are obtained after a sensor passes through a BilSTM layer; and the last one is CRF, and the vector outputs a label sequence with the highest score after passing through a CRF layer.
The optimization objective function of the BilSTM + CRF structure is as follows:
Figure 585430DEST_PATH_IMAGE001
(1)
wherein:
Figure DEST_PATH_IMAGE008
(2)
wherein X is an input sequence of the formula:
Figure DEST_PATH_IMAGE009
(3)
y is the tag corresponding to the input sequence, as follows:
Figure 42213DEST_PATH_IMAGE010
(4)
thereby defining x->The scoring formula of y is formula (2), wherein Ai,jIs a transition score matrix representing the score, y, of the transition from label i to label j0And ynP is the fractional matrix of the BilSTM output, P isi,yiA score representing that the ith word in the input sequence corresponds to the jth tag.
Given an input sequence X, the probability of a tag sequence Y is obtained:
Figure DEST_PATH_IMAGE011
(5)
wherein Y isxAll possible tag sequences corresponding to the input sequence X, i.e. each tag sequence corresponding to a sentence, contain a score and a probability. Our goal is to maximize the probability that the input sequence X corresponds to a true tag Y. I.e. minimizing the loss function, the formula is transformed as follows:
Figure 208883DEST_PATH_IMAGE012
(6)
and carrying out simultaneous log removal on two sides of the formula 6 to obtain the formula 1.

Claims (8)

1. A punctuation-free text clause dividing method is characterized by comprising the following steps:
s1: preprocessing Chinese text data, extracting the text data to be processed from a database, determining the identifier category of a segmented text, cleaning the data, and removing special characters in the text;
s2: performing data labeling on the processed data according to the label, adding a corpus behind each character, arranging one line for each character in a single corpus, and separating the corpus from the character by using a blank space;
s3: putting the marked text into a BERT Chinese pre-training language model, and converting the Chinese characters into Embedding space word vectors;
s4: putting the extracted Embedding space vector into a BilSTM to obtain a vector h containing context information, and mapping the dimension of h into a vector with the dimension as the number of labels after passing through a dropout layer;
s5: and (5) outputting the label sequence with the highest score after the result generated in the step (S4) passes through a CRF layer, and continuously training and iterating the steps until the model converges.
2. The punctual-free text clause method according to claim 1, characterized in that: the optimization objective function of steps S4 and S5, i.e. the structure of BilSTM + CRF, is:
Figure DEST_PATH_IMAGE001
3. the punctual-free text clause method according to claim 1, characterized in that: in step S5, in order to optimize the model effect, different parameters are set for model training, the trained model is tested on the test set, and the model with the highest F1 value is selected as the final model.
4. The punctual-free text clause method according to claim 1, characterized in that: the special characters in the text described in step S1 include: punctuation marks, spaces, linefeeds, and the like.
5. The punctual-free text clause method according to claim 1, characterized in that: the Embedding space word vector in step S3 is the unit sum of 3 embedded features, which are the word vector, sentence vector and position vector of the word itself.
6. The punctual-free text clause method according to claim 1, characterized in that: the vector h in step S4 is the concatenation of the above information vector h-before and the below information vector h-after.
7. The punctual-free text clause method according to claim 3, characterized in that: the parameters mainly comprise learning rate and batch _ size.
8. A punctual-free text clause method according to any one of claims 1 to 7, characterized in that:
further comprising step S6: and predicting, namely inputting the predicted sequence into a model, decoding by using a Viterbi algorithm to obtain a label sequence, comparing the label sequence with each position of the input sequence, and adding separators at the corresponding positions according to the label type of the punctuation marks to finish automatic clause division of the input sequence.
CN202110220106.2A 2021-02-27 2021-02-27 Punctuation-free text clause dividing method Pending CN112765957A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110220106.2A CN112765957A (en) 2021-02-27 2021-02-27 Punctuation-free text clause dividing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110220106.2A CN112765957A (en) 2021-02-27 2021-02-27 Punctuation-free text clause dividing method

Publications (1)

Publication Number Publication Date
CN112765957A true CN112765957A (en) 2021-05-07

Family

ID=75704294

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110220106.2A Pending CN112765957A (en) 2021-02-27 2021-02-27 Punctuation-free text clause dividing method

Country Status (1)

Country Link
CN (1) CN112765957A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108932226A (en) * 2018-05-29 2018-12-04 华东师范大学 A kind of pair of method without punctuate text addition punctuation mark
CN110413785A (en) * 2019-07-25 2019-11-05 淮阴工学院 A kind of Automatic document classification method based on BERT and Fusion Features
CN111241837A (en) * 2020-01-04 2020-06-05 大连理工大学 Theft case legal document named entity identification method based on anti-migration learning
CN111339750A (en) * 2020-02-24 2020-06-26 网经科技(苏州)有限公司 Spoken language text processing method for removing stop words and predicting sentence boundaries
CN111680511A (en) * 2020-04-21 2020-09-18 华东师范大学 Military field named entity identification method with cooperation of multiple neural networks
CN112287640A (en) * 2020-11-02 2021-01-29 杭州师范大学 Sequence labeling method based on Chinese character structure

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108932226A (en) * 2018-05-29 2018-12-04 华东师范大学 A kind of pair of method without punctuate text addition punctuation mark
CN110413785A (en) * 2019-07-25 2019-11-05 淮阴工学院 A kind of Automatic document classification method based on BERT and Fusion Features
CN111241837A (en) * 2020-01-04 2020-06-05 大连理工大学 Theft case legal document named entity identification method based on anti-migration learning
CN111339750A (en) * 2020-02-24 2020-06-26 网经科技(苏州)有限公司 Spoken language text processing method for removing stop words and predicting sentence boundaries
CN111680511A (en) * 2020-04-21 2020-09-18 华东师范大学 Military field named entity identification method with cooperation of multiple neural networks
CN112287640A (en) * 2020-11-02 2021-01-29 杭州师范大学 Sequence labeling method based on Chinese character structure

Similar Documents

Publication Publication Date Title
US11501182B2 (en) Method and apparatus for generating model
CN110532554B (en) Chinese abstract generation method, system and storage medium
CN111985239B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN110826331A (en) Intelligent construction method of place name labeling corpus based on interactive and iterative learning
CN110598203A (en) Military imagination document entity information extraction method and device combined with dictionary
CN110597997B (en) Military scenario text event extraction corpus iterative construction method and device
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN112417854A (en) Chinese document abstraction type abstract method
CN113360582B (en) Relation classification method and system based on BERT model fusion multi-entity information
Gao et al. Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF
CN113191148A (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
CN115392259A (en) Microblog text sentiment analysis method and system based on confrontation training fusion BERT
CN113673241B (en) Text abstract generation framework system and method based on example learning
CN114048314A (en) Natural language steganalysis method
CN113934835A (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
CN116522165B (en) Public opinion text matching system and method based on twin structure
Fei et al. GFMRC: A machine reading comprehension model for named entity recognition
CN114579706B (en) Automatic subjective question review method based on BERT neural network and multi-task learning
CN115392255A (en) Few-sample machine reading understanding method for bridge detection text
CN112765957A (en) Punctuation-free text clause dividing method
CN115344668A (en) Multi-field and multi-disciplinary science and technology policy resource retrieval method and device
CN114611489A (en) Text logic condition extraction AI model construction method, extraction method and system
CN114528459A (en) Semantic-based webpage information extraction method and system
CN113868372A (en) Statistical communique index extraction method based on rules and text sequence labeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210507