CN112765957A - Punctuation-free text clause dividing method - Google Patents
Punctuation-free text clause dividing method Download PDFInfo
- Publication number
- CN112765957A CN112765957A CN202110220106.2A CN202110220106A CN112765957A CN 112765957 A CN112765957 A CN 112765957A CN 202110220106 A CN202110220106 A CN 202110220106A CN 112765957 A CN112765957 A CN 112765957A
- Authority
- CN
- China
- Prior art keywords
- vector
- text
- model
- data
- free text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 239000013598 vector Substances 0.000 claims abstract description 39
- 238000012549 training Methods 0.000 claims abstract description 20
- 238000004140 cleaning Methods 0.000 claims abstract description 5
- 238000002372 labelling Methods 0.000 claims abstract description 5
- 238000013507 mapping Methods 0.000 claims abstract description 4
- 238000007781 pre-processing Methods 0.000 claims abstract description 3
- 238000012360 testing method Methods 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 5
- 238000005457 optimization Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 claims description 2
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000008034 disappearance Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000015654 memory Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000001072 colon Anatomy 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to the technical field of artificial intelligence in the field of computers, in particular to a punctuation-free text clause dividing method, which comprises the following steps of S1: preprocessing Chinese text data, extracting the text data to be processed from a database, determining the identifier category of a segmented text, cleaning the data, and removing special characters in the text; s2: performing data labeling on the processed data according to the label, adding a corpus behind each character, arranging one line for each character in a single corpus, and separating the corpus from the character by using a blank space; s3: putting the marked text into a BERT Chinese pre-training language model, and converting the Chinese characters into Embedding space word vectors; s4: putting the extracted Embedding space vector into a BilSTM to obtain a vector h containing context information, and mapping the dimension of h into a vector with the dimension as the number of labels after passing through a dropout layer; s5: and (5) outputting the label sequence with the highest score after the result generated in the step (S4) passes through a CRF layer, and continuously training and iterating the steps until the model converges.
Description
Technical Field
The invention relates to the technical field of artificial intelligence in the field of computers, in particular to a punctuation-free text clause dividing method.
Background
With the rise of the internet, the world also rapidly enters the information era, more and more people start to contact the internet, and chatting, shopping and the like also become the normalcy. The Chinese has nearly billions of netizens, most of which have low cultural degree, and in the process of leaving messages on websites such as news media, a lot of message contents without any division marks are generated, which causes great time cost and labor cost for auditors. In order to reduce the labor and time cost, it is important to intelligently parse such message contents through computer technology.
At present, some research practices are provided in the technical field, including that an expansion convolution network and a recurrent neural network are used for realizing a clause function, but the expansion convolution network is difficult to consider context semantic information of a text, and information loss is easily caused; the recurrent neural network consumes time in the training process, and easily causes the problems of gradient disappearance and the like, thereby increasing the difficulty of model training and reducing the accuracy of model prediction.
Disclosure of Invention
In order to solve the problems in the prior art, the invention aims to provide a punctuation-free text clause dividing method, which overcomes the problems of time consumption in training, disappearance of network gradient, missing of context semantics and the like in the prior art, and provides a novel intelligent clause solving method, aiming at solving the problem that no clause mark exists in a Chinese text, facilitating the review of reviewers, improving the review efficiency, reducing the time and labor cost and solving the problems in the prior art.
The technical scheme adopted by the invention is as follows: a punctuation-free text clause dividing method comprises the following steps:
s1: preprocessing Chinese text data, extracting the text data to be processed from a database, determining the identifier category of a segmented text, cleaning the data, and removing special characters in the text;
s2: performing data labeling on the processed data according to the label, adding a corpus behind each character, arranging one line for each character in a single corpus, and separating the corpus from the character by using a blank space;
s3: putting the marked text into a BERT Chinese pre-training language model, and converting the Chinese characters into Embedding space word vectors;
s4: putting the extracted Embedding space vector into a BilSTM to obtain a vector h containing context information, and mapping the dimension of h into a vector with the dimension as the number of labels after passing through a dropout layer;
s5: and (5) outputting the label sequence with the highest score after the result generated in the step (S4) passes through a CRF layer, and continuously training and iterating the steps until the model converges.
The preferred steps S4 and S5, i.e., the optimization objective function of the BilSTM + CRF structure, are:。
in step S5, different parameters are set to train the model, the trained model is tested on the test set, and the model with the highest F1 value is selected as the final model.
The specific characters in the text described in the preferred step S1 include: punctuation marks, spaces, linefeeds, and the like.
The Embedding space word vector in step S3 is preferably the unit sum of 3 embedded features, which is the word vector, sentence vector, and position vector of the word itself.
The vector h in step S4 is preferably a concatenation of the above information vector h-before and the below information vector h-after.
The preferred parameters mainly include learning rate and batch _ size.
Preferably, the method further comprises step S6: and predicting, namely inputting the predicted sequence into a model, decoding by using a Viterbi algorithm to obtain a label sequence, comparing the label sequence with each position of the input sequence, and adding separators at the corresponding positions according to the label type of the punctuation marks to finish automatic clause division of the input sequence.
The invention has the beneficial effects that:
1. the training is efficient and convenient, and the training time is saved;
2. the method is more accurate, and the problems of network gradient disappearance and context semantic deletion are avoided;
3. promote staff work efficiency, reduce time and human cost.
Drawings
FIG. 1 is a network architecture diagram illustrating a data annotation process according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an embodiment of the present invention after data annotation is completed;
FIG. 3 is a block diagram of the BERT language model of the present invention;
fig. 4 is a network architecture diagram of the BILSTM of the present invention.
Detailed Description
The invention is described in detail below with reference to the following figures and detailed description:
BERT: a natural language processing pre-training model of Bidirective Encoder responses from transformations Google open source;
data cleaning: a process of re-examining and verifying data, with the purpose of deleting duplicate information, correcting existing errors, and providing data consistency;
data annotation: a behavior of processing the artificial intelligent learning data by a data annotator by means of an annotation tool or a programming code;
BilSTM: bidirectional Short Term Memory networks (Long Short Term Memory networks), a special form of RNN, has the ability to learn Long Term dependence
CRF: conditional random field, a mathematical algorithm, based on probabilistic graphical models following markov properties
The viterbi algorithm: also called viterbi algorithm, a path-optimal search algorithm.
Example (b):
text data to be processed is extracted from a database, and identifier categories of the segmented text are determined. The method sets comma, period, question mark, colon mark, exclamation mark, semicolon and sentence beginning and ending identifiers.
And (3) removing special characters (such as punctuation marks, spaces, line feed characters and the like) in the text by cleaning the data, and labeling the cleaned text data with data. One line of each character in a single corpus, spaces are used for separating the corpora, the beginning of the text is marked by BEG, the front character of the punctuation mark is marked by B and the punctuation mark category is added, if comma English word is comma, the first three letters are combined with B to form B _ com, similarly, a character after the punctuation mark is combined with E to form E _ com, if the end of the text is a query word marked by QUE, the character is 'you good' and has eaten"labeling procedure as shown in FIG. 1, using EXC for exclamation words, using EEG for identification if periods, and finally saving as txt file format as shown in FIG. 2, and dividing the test set training set by two eight proportions, wherein 20% is used as test data set and 80% is used as training data set.
The labeled text is placed into a BERT Chinese pre-training language model, and the Chinese characters are converted into Embedding space word vectors, as shown in FIG. 3.
The BERT embedding coding vector is a unit sum of 3 embedded features, and is a word vector, a sentence vector and a position vector of a word. As the input vector for BiLSTM.
And putting the extracted Embedding space vector into a bidirectional long-time memory network to obtain a vector h containing context information, wherein h is the splicing of the above information vector h-before and the below information vector h-after. And mapping the dimension of h into a vector with the dimension as the number of the labels after passing through a dropout layer. The network structure of BILSTM is shown in FIG. 4:
and outputting a label sequence with the highest score after the generated result passes through a CRF layer, and continuously training and iterating the steps until the model converges.
In order to optimize the model effect, different parameters (such as learning rate, batch times, lstm number and the like) are set for model training, the trained models are tested on a test set, and the model with the highest F1 value is selected as a final model.
And (4) testing and comparing, wherein the data in the real scene (the user message and comment data) is used as the training data and the testing data of the model. The maximum length max _ seq _ length of the input sequence is set to 192 according to the distribution condition of the real data sample length and the performance of a machine GPU (tesla P100, video memory 12G), the total data amount is 20 ten thousand, the number of layers of a Transformer is 12, the dimensionality of a feature layer is 768, an optimizer uses Adam, the dropout is set to 0.5, clip is 0.5, and the results of other parameter settings, iteration times and evaluation index F1 values are as follows:
the parameter adjustment mainly aims at the learning rate and the batch _ size, the learning rate directly influences the convergence state of the model, and the batch _ size influences the generalization ability of the model, and the parameters influence the performance of the model most importantly. The first training learning rate was set to 10-5, the number of batches was 36 per batch, the LSTM unit size was 64, the number of iterations and the F1 values were as follows:
adjust parameters, increase batch _ size, otherwise do not change:
the learning rate is increased, batch _ size is set to 36, and the other parameters are unchanged:
change batch _ size, the learning rate remains 10-4:
repeated tests show that the model is more stable and has stronger generalization capability due to the large learning rate and the large batch times.
And predicting, namely inputting the predicted sequence into a model, decoding by using a Viterbi (viterbi) algorithm to obtain a label sequence, comparing the label sequence with each position of the input sequence, and adding separators at corresponding positions according to the label type of punctuation marks to finish automatic clause division of the input sequence.
The network of the invention is composed of three parts, one is a BERT coding network, and characters are coded into an Embedding sensor; one is a BilSTM network, and vectors containing context information are obtained after a sensor passes through a BilSTM layer; and the last one is CRF, and the vector outputs a label sequence with the highest score after passing through a CRF layer.
The optimization objective function of the BilSTM + CRF structure is as follows:
wherein:
wherein X is an input sequence of the formula:
y is the tag corresponding to the input sequence, as follows:
thereby defining x->The scoring formula of y is formula (2), wherein Ai,jIs a transition score matrix representing the score, y, of the transition from label i to label j0And ynP is the fractional matrix of the BilSTM output, P isi,yiA score representing that the ith word in the input sequence corresponds to the jth tag.
Given an input sequence X, the probability of a tag sequence Y is obtained:
wherein Y isxAll possible tag sequences corresponding to the input sequence X, i.e. each tag sequence corresponding to a sentence, contain a score and a probability. Our goal is to maximize the probability that the input sequence X corresponds to a true tag Y. I.e. minimizing the loss function, the formula is transformed as follows:
and carrying out simultaneous log removal on two sides of the formula 6 to obtain the formula 1.
Claims (8)
1. A punctuation-free text clause dividing method is characterized by comprising the following steps:
s1: preprocessing Chinese text data, extracting the text data to be processed from a database, determining the identifier category of a segmented text, cleaning the data, and removing special characters in the text;
s2: performing data labeling on the processed data according to the label, adding a corpus behind each character, arranging one line for each character in a single corpus, and separating the corpus from the character by using a blank space;
s3: putting the marked text into a BERT Chinese pre-training language model, and converting the Chinese characters into Embedding space word vectors;
s4: putting the extracted Embedding space vector into a BilSTM to obtain a vector h containing context information, and mapping the dimension of h into a vector with the dimension as the number of labels after passing through a dropout layer;
s5: and (5) outputting the label sequence with the highest score after the result generated in the step (S4) passes through a CRF layer, and continuously training and iterating the steps until the model converges.
3. the punctual-free text clause method according to claim 1, characterized in that: in step S5, in order to optimize the model effect, different parameters are set for model training, the trained model is tested on the test set, and the model with the highest F1 value is selected as the final model.
4. The punctual-free text clause method according to claim 1, characterized in that: the special characters in the text described in step S1 include: punctuation marks, spaces, linefeeds, and the like.
5. The punctual-free text clause method according to claim 1, characterized in that: the Embedding space word vector in step S3 is the unit sum of 3 embedded features, which are the word vector, sentence vector and position vector of the word itself.
6. The punctual-free text clause method according to claim 1, characterized in that: the vector h in step S4 is the concatenation of the above information vector h-before and the below information vector h-after.
7. The punctual-free text clause method according to claim 3, characterized in that: the parameters mainly comprise learning rate and batch _ size.
8. A punctual-free text clause method according to any one of claims 1 to 7, characterized in that:
further comprising step S6: and predicting, namely inputting the predicted sequence into a model, decoding by using a Viterbi algorithm to obtain a label sequence, comparing the label sequence with each position of the input sequence, and adding separators at the corresponding positions according to the label type of the punctuation marks to finish automatic clause division of the input sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110220106.2A CN112765957A (en) | 2021-02-27 | 2021-02-27 | Punctuation-free text clause dividing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110220106.2A CN112765957A (en) | 2021-02-27 | 2021-02-27 | Punctuation-free text clause dividing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112765957A true CN112765957A (en) | 2021-05-07 |
Family
ID=75704294
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110220106.2A Pending CN112765957A (en) | 2021-02-27 | 2021-02-27 | Punctuation-free text clause dividing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112765957A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108932226A (en) * | 2018-05-29 | 2018-12-04 | 华东师范大学 | A kind of pair of method without punctuate text addition punctuation mark |
CN110413785A (en) * | 2019-07-25 | 2019-11-05 | 淮阴工学院 | A kind of Automatic document classification method based on BERT and Fusion Features |
CN111241837A (en) * | 2020-01-04 | 2020-06-05 | 大连理工大学 | Theft case legal document named entity identification method based on anti-migration learning |
CN111339750A (en) * | 2020-02-24 | 2020-06-26 | 网经科技(苏州)有限公司 | Spoken language text processing method for removing stop words and predicting sentence boundaries |
CN111680511A (en) * | 2020-04-21 | 2020-09-18 | 华东师范大学 | Military field named entity identification method with cooperation of multiple neural networks |
CN112287640A (en) * | 2020-11-02 | 2021-01-29 | 杭州师范大学 | Sequence labeling method based on Chinese character structure |
-
2021
- 2021-02-27 CN CN202110220106.2A patent/CN112765957A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108932226A (en) * | 2018-05-29 | 2018-12-04 | 华东师范大学 | A kind of pair of method without punctuate text addition punctuation mark |
CN110413785A (en) * | 2019-07-25 | 2019-11-05 | 淮阴工学院 | A kind of Automatic document classification method based on BERT and Fusion Features |
CN111241837A (en) * | 2020-01-04 | 2020-06-05 | 大连理工大学 | Theft case legal document named entity identification method based on anti-migration learning |
CN111339750A (en) * | 2020-02-24 | 2020-06-26 | 网经科技(苏州)有限公司 | Spoken language text processing method for removing stop words and predicting sentence boundaries |
CN111680511A (en) * | 2020-04-21 | 2020-09-18 | 华东师范大学 | Military field named entity identification method with cooperation of multiple neural networks |
CN112287640A (en) * | 2020-11-02 | 2021-01-29 | 杭州师范大学 | Sequence labeling method based on Chinese character structure |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11501182B2 (en) | Method and apparatus for generating model | |
CN110532554B (en) | Chinese abstract generation method, system and storage medium | |
CN111985239B (en) | Entity identification method, entity identification device, electronic equipment and storage medium | |
CN110826331A (en) | Intelligent construction method of place name labeling corpus based on interactive and iterative learning | |
CN110598203A (en) | Military imagination document entity information extraction method and device combined with dictionary | |
CN110597997B (en) | Military scenario text event extraction corpus iterative construction method and device | |
CN111783394A (en) | Training method of event extraction model, event extraction method, system and equipment | |
CN112417854A (en) | Chinese document abstraction type abstract method | |
CN113360582B (en) | Relation classification method and system based on BERT model fusion multi-entity information | |
Gao et al. | Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF | |
CN113191148A (en) | Rail transit entity identification method based on semi-supervised learning and clustering | |
CN115759119B (en) | Financial text emotion analysis method, system, medium and equipment | |
CN115392259A (en) | Microblog text sentiment analysis method and system based on confrontation training fusion BERT | |
CN113673241B (en) | Text abstract generation framework system and method based on example learning | |
CN114048314A (en) | Natural language steganalysis method | |
CN113934835A (en) | Retrieval type reply dialogue method and system combining keywords and semantic understanding representation | |
CN116522165B (en) | Public opinion text matching system and method based on twin structure | |
Fei et al. | GFMRC: A machine reading comprehension model for named entity recognition | |
CN114579706B (en) | Automatic subjective question review method based on BERT neural network and multi-task learning | |
CN115392255A (en) | Few-sample machine reading understanding method for bridge detection text | |
CN112765957A (en) | Punctuation-free text clause dividing method | |
CN115344668A (en) | Multi-field and multi-disciplinary science and technology policy resource retrieval method and device | |
CN114611489A (en) | Text logic condition extraction AI model construction method, extraction method and system | |
CN114528459A (en) | Semantic-based webpage information extraction method and system | |
CN113868372A (en) | Statistical communique index extraction method based on rules and text sequence labeling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210507 |