CN109325114A

CN109325114A - A kind of text classification algorithm merging statistical nature and Attention mechanism

Info

Publication number: CN109325114A
Application number: CN201810817616.6A
Authority: CN
Inventors: 程艳芬; 李超; 陈逸灵
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2018-07-24
Filing date: 2018-07-24
Publication date: 2019-02-12

Abstract

The present invention relates to a kind of text classification algorithm for merging statistical nature and Attention mechanism, Attention mechanism is gradually applied to natural language processing field.Existing method significantly increases calculation amount when calculating Attention weight, and the present invention proposes to calculate Attention weight on the event level of structuring.It on the one hand, include semanteme more abundant relative to word or phrase event；On the other hand, the Attention mechanism based on event reduces computation complexity.Meanwhile it joined statistical nature on the basis of Attention weight computing.Compared with existing model, the semantic information and corresponding statistical nature that event structure is included improve the quality of text vector expression, obtain preferable classification performance.Recruitment evaluation is carried out to it on classification accuracy, the experimental results showed that the model obtains more preferably effect while reducing the training time.

Description

A kind of text classification algorithm merging statistical nature and Attention mechanism

Technical field

The present invention relates to a kind of novel text classification algorithms to classify particular for large-scale text data set improving Reduce the time complexity of calculating while accuracy rate.

Background technique

The data that the fast development of network and information technology is are increased with index rank, and text is internet information statement How principal mode extracts the research that crucial, effective information is current the field of data mining from many and diverse text data Hot spot, key technology of the Text Classification as the field of data mining, can the information to text carry out preliminary processing, point Classification out.

The main task of text classification is text representation, feature extraction, sorting algorithm and recruitment evaluation.In order to quilt Computer calculates and processing it may first have to the text of initial input be showed using corresponding feature extraction algorithm, then Sorting algorithm could be used to be trained the text feature of extraction, and by generate training pattern to text to be sorted into Row classification.The method of traditional Text character extraction is mainly based upon the model of probability, by calculate text statistical nature come Keyword is extracted, largely ignores the syntactic and semantic information of text deep layer, this will necessarily reduce the accuracy rate of classification.

Summary of the invention

In view of the above deficiencies, the present invention proposes a kind of text that Attention weight is calculated on the event level of structuring This sorting algorithm, on the one hand, relative to word or phrase, event includes semanteme more abundant；On the other hand, based on event Attention mechanism reduces the time complexity of calculating.Meanwhile it can not learning text to solve existing deep learning model The problem of statistical nature, joined statistical nature on the basis of Attention weight computing, compared with existing model, thing The semantic information and corresponding statistical nature that part structure is included improve the quality of text vector expression, achieve preferable Classification performance.

In conjunction with existing textual classification model, the present invention proposes that a kind of Attention mechanism based on event is used for text Classification, main difference is as follows compared with existing model:

(1) existing Attention mechanism is mainly based upon word rank, proposed by the present invention based on event Attention mechanism calculates weight on event structure level.

(2) deep learning model without calligraphy learning has the statistical nature centainly influenced to text classification result end to end, Statistical nature is added in model, obtains the text representation vector comprising more information.

The present invention adopts the following technical scheme:

A kind of text classification algorithm merging statistical nature and Attention mechanism, it is characterised in that: include:

Step 1 segments a document sets to it first, part-of-speech tagging and stop words is gone to handle, and being recorded The word frequency information of word, while the synonym in document is replaced, then, each word is instructed using word2vec tool Practice and generate term vector, tf-idf value is calculated to the word that gets, is assigned respectively according to the part of speech of word to tf-idf value related Weight obtain the statistical characteristics of the word；

Step 2 extracts event in every document, and calculates the statistical characteristics of event and based on event Attention weight；

Step 3, fusion event Attention weight and event statistics characteristic value, obtaining final vector indicates；

Step 4 is carried out model training, and is tested using finally obtained training result test text and classified As a result.

In the text classification algorithm of above-mentioned a kind of fusion statistical nature and Attention mechanism, the step 1 is specifically wrapped Include following steps:

Step 1 does participle and part-of-speech tagging processing to document sets using Chinese word segmenting tool NLPIR, then utilizes Chinese Stop words vocabulary rejects stop words in document；

Step 2 uses Harbin Institute of Technology's " Chinese thesaurus " extended edition as semantic dictionary, and the near synonym in document are all replaced It is changed to and represents word, obtain final text input sequence；

Step 3 generates term vector to each word training in text input sequence using word2vec tool；

Step 4, each term vector generated to training calculate its tf-idf value, and according to the part of speech of word and tf-idf value The statistical characteristics of the word, calculation is calculated are as follows: W_i=pos_w*pos_i+tfidf_w*tfidf_i, wherein pos_iIt indicates The part of speech value of word, the value of each weight are as follows: pos_w=0.5, tfidf_w=0.8.

In the text classification algorithm of above-mentioned a kind of fusion statistical nature and Attention mechanism, the step 2 is specifically wrapped Include following steps:

Step 1, given document, carry out interdependent point to each sentence in document using the dependency analysis tool of Stanford Analysis, obtains the dependency structure of every sentence；Then event is extracted using two kinds of dependences of nsubj and dobj, if two Nsubj and dobj relationship possesses identical predicate, then can be merged into an event, indicates <subj with a triple, Verb, obj >, relationship is not merged for the part in interdependent result, is still left binary group event；

The event that step 2, basis are extracted, obtaining the corresponding vector of event indicates:In formula, x_subj、x_verb、x_objThe vector for respectively indicating subject in event, predicate and object indicates, calculates the event e in text₁,e₂, e₃,……,e_tTo the influence power weight of article totality, the effect of critical event can be protruded, it is whole to article to reduce critical events The influence of body semanteme, the semantic coding of attention distribution probability calculate as follows:

Wherein a_kiIndicate node i relative to the attention weight integrally inputted, e_kiIndicate the vector table of discrepancy event sequence Show, T is the number of the Event element of list entries, h_kThe corresponding hiding layer state of X` is inputted for whole event；h_iIndicate input sequence The corresponding hidden layer state value of i-th of Event element of column, v, W, U are weight matrix, and b is offset parameter, and tanh function is Activation primitive；

Step 3 calculates its statistical characteristics to each of textual event collection event:In formulaThe statistical characteristics of subject in event, object, predicate is respectively indicated, if not including subject or guest in event Then its value is 0 to language.

In the text classification algorithm of above-mentioned a kind of fusion statistical nature and Attention mechanism, the step 3 is specifically wrapped Include following steps:

Step 1, fusion statistical nature and Attention weight, A_ki=T_w*T_i+A_w*a_ki, a in formula_kiFor event Attention weight, T_iFor the statistical characteristics of event, T_wWith A_wIndicate that the statistical nature for being respectively event and Attention are weighed Value distributes certain specific gravity, wherein T_wValue is 1, A_wValue is 2.5；

Step 2 obtains semantic coding C by event criticality and the cumulative of hiding layer state product,Formula Middle A_kiFor the weight of event obtained in above-mentioned steps 1, h_iFor two-way length, memory network hidden layer state value, T are document in short-term In include event number；

Step 3, the semantic coding C that will be obtained, two-way length memory network hidden layer state value h in short-term_kAnd text is averaged Input input of the X` as two-way length memory network module in short-term, H_k`=H (C, h_k, X`) and it is that the final vector of document indicates.

In the text classification algorithm of above-mentioned a kind of fusion statistical nature and Attention mechanism, the step 4 is specifically wrapped Include following steps:

Final text representation vector is sent into softmax classifier by step 1, carries out model training；

Step 2 tests training result using test text, obtains final classification results.

The present invention can obtain more semantic features, while reduce influence of the useless feature to classification results, for evaluation The validity of algorithm designs and Implements five groups of comparative experimentss on four data sets, and experiment is run on the server of 64G, leads to The average operating time for crossing contrast model and the convergence rate under identical learning rate, it is known that, the present invention effectively reduces training Time, greatly accelerate convergence rate, while the accuracy rate classified has corresponding promotion, by synonym replacement with will not be same Adopted word replacement accuracy rate averagely promotes 1.68%, averagely promotes 2.22% after fusion statistical nature, this illustrate statistical nature for There are certain influences for text classification accuracy, and the Attention mechanism based on event averagely promotes 3.62%.Using this paper mould Type, accuracy rate averagely promote 4.97%, and classifying quality reaches best.

Detailed description of the invention

Fig. 1 is Attention network structure.

Fig. 2 is model training convergent comparison diagram.

Specific embodiment

Below with reference to the embodiments and with reference to the accompanying drawing the technical solutions of the present invention will be further described.

Embodiment:

The text inputted for one section, first segments it, and stop words and synonym replacement is gone to handle.Then sharp Term vector is generated to the training of each word with word2vec tool, tf-idf value is calculated to the word got, according to word Part of speech and tf-idf value assign the statistical characteristics that relevant weight obtains the word respectively.It calculates based on event Attention weight, while calculating the statistical nature weight of the event.Two weights are merged, the spy obtained based on this Sign vector contains more semantic informations.Steps are as follows for specific algorithm logic:

(1) document sets first segment it, part-of-speech tagging and stop words gone to handle.

(2) word frequency information of word is recorded, while the synonym in document is replaced.

(3) event in every document is extracted

(4) statistical characteristics of event and the Attention weight based on event are calculated.

(5) fusion event Attention weight and statistical characteristics, obtaining final vector indicates.

(6) model training obtains classification results.

1. the statistical characteristics of word.

The synonym in one section of text is removed first, uses Harbin Institute of Technology's " Chinese thesaurus " extended edition as semantic word herein Allusion quotation, wherein each word has several codings, each coding is made of complete five layer identification codes and a flag bit description, five layer identification codes One atom clump of coded representation, flag bit are "=", " # ", "@", wherein "=" represents synonymous, " # " represents similar, belongs to phase Word is closed, " " indicates independent word.In order to replace synonym, so the atom clump for being labeled as "=" is only chosen, to text envelope When breath is pre-processed, using first word of each atom clump as the representative word of the clump, by the near synonym in text It replaces all with and represents word, obtain final text input sequence.

The calculating of word statistical nature relies primarily on statistical theory, is calculated using existing data to estimate feature to last The influence of classification, to screen effective feature.Although the introducing of deep learning overcomes the defect of Feature Words independence assumption, Get more semantic informations.But influence of the word statistical nature for classification results be can not ignore.Special is counted to word The calculating of sign is as follows.

Define 1. word W_iPart of speech value pos_iFor W_iThe different degree of affiliated part of speech, the value of each part of speech are as follows:

The statistical characteristics calculation formula of corresponding word is as follows:

W_i=pos_w*pos_i+tfidf_w*tfidf_i；

For the statistical characteristics for obtaining word, certain weight is distributed part of speech and tf-idf respectively, by various features value Cumulative summation obtains total characteristic value.Pos in formula_iIndicate word W_iPart of speech value, tfidf_iIndicate word W_iTfidf value, pos_wIndicate part of speech weight, tfidf_wIndicate tf-idf weight.Experiment tuning obtains each weight value are as follows: pos_w=0.5, tfidf_w=0.8.

2. the Attention mechanism based on event.

For one section of text, judge that its generic meets normal cognitive law so that " event " is unit.Given text This, carries out dependency analysis to each sentence in document first with the dependency analysis tool of stanford, obtains every sentence Dependency structure；Then event is extracted using two kinds of dependences of nsubj and dobj, if two nsubj and dobj relationships Possess identical predicate, then can be merged into an event, is indicated with a triple<subj,verb,obj>.For interdependent As a result the part in does not merge relationship, is still left binary group event.

After extracting possible event, word is replaced using trained term vector, obtained event is expressed as having There is the vector of 3 times of term vector dimensions.Calculation are as follows:

The semantic coding of attention distribution probability calculates as follows.

Wherein a_kiNode i is indicated relative to the attention weight integrally inputted, T is the number of the Event element of list entries Mesh, h_kThe corresponding hiding layer state of X` is inputted for whole event.h_iIndicate the corresponding hidden layer of i-th of Event element of list entries State value, v, W, U are weight matrix, and b is offset parameter.Model structure is as shown in Figure 1.

3. the fusion of feature weight.

In the extraction process of keyword, by the weight information knot of traditional statistical nature and Attention mechanism acquisition Input of the obtained semantic coding as BiLSTM is closed, the feature vector of final text had both considered the statistical characteristics in text, It simultaneously include more semantic information.The text representation vector that the algorithm obtains can more reflect that the main information of text effectively mentions The accuracy rate of classification is risen.Main processing logic is as shown in algorithm 1.

Algorithm passes through the sum for calculating the statistical characteristics of each event elements corresponding word first, obtains its corresponding feature Then weight calculates the Attention weight of the event, by two values distribute certain weights sum to obtain the event it is corresponding Weight finally will be semantic by the way that the cumulative of output valve product of event criticality and BiLSTM hidden layer is obtained semantic coding C C is encoded, input of the input vector X` of general characteristic vector and the article totality of article as BiLSTM module obtains most The hidden layer state value H of posterior nodal point_k`It is exactly final feature vector.This feature vector contains the weight of history input node Information highlights the effect of keyword, finally using logistic regression building classifiers of classifying, obtains classification results more.

4. effect of the invention.

For the validity for verifying model, yelp2013, Sogou corpus, Amazon Review and IMDB conduct are chosen The data set of experiment chooses 90% respectively and is used as training set, and 10% is used as test set.It is deep that the frame of experiment is based on TensorFlow Learning framework is spent, design realizes that five groups of comparative experimentss are respectively as follows: BiLSTM_Attention (BA), synonym replacement BiLSTM_Attention (S_BA) merges the BiLSTM_Attention (T_BA) of statistical nature, the BiLSTM_ based on event The model (Proposed) that Attention (E_BA) and the present invention design.Majorized function uses Adam, learning rate in experiment It is set as 0.01, num_epoch and is set as 20, batch_size 32, the number of the node of hidden layer is 256.In order to be promoted Training speed model uses single layer network, and classifier uses polytypic logistic regression classifier.Specific experimentation are as follows: will After the Text Pretreatment to be trained, indicate that characteristic extraction part is using upper using the vector that word2vec tool is mapped to 50 dimensions Corresponding model realization in five kinds of models is stated, the input of classifier is the last hidden layer state value of corresponding model.

Experiment is run on the server of 64G, is known by the average operating time of contrast model based on event Attention mechanism is effectively reduced the trained time, while greatly accelerating convergent speed, and three models are at four Training time on data set is as shown in table 1.Training result of the two of them model on yelp2013 data set such as Fig. 2 institute Show.It can be seen that the Attention mechanism based on event faster compared to BA convergence rate in figure, while accuracy rate is higher.

The 1 model training time of table

For above-mentioned five groups of experiments, then every group of experiment is chosen the optimal data of result, is obtained by repeatedly training adjustment Experimental result carry out counting as shown in table 2.The model that result by comparing five groups of experiments is known that design is realized can be with Effectively improve the accuracy rate of text classification.Synonym is replaced and does not improve synonym replacement accuracy rate averagely 1.68%, 2.22% is averagely promoted after merging statistical nature, it is certain that this illustrates that statistical nature has text classification accuracy Influence, the Attention mechanism based on event averagely promotes 3.62%.Using this paper model, accuracy rate is averagely promoted 4.97%, classifying quality reaches best.

Accuracy rate of the 2 five kinds of models of table on four data sets

Specific embodiment described herein is only an example for the spirit of the invention.The neck of technology belonging to the present invention The technical staff in domain can make various modifications or additions to the described embodiments or replace by a similar method In generation, however, it does not deviate from the spirit of the invention or beyond the scope of the appended claims.

Claims

1. a kind of text classification algorithm for merging statistical nature and Attention mechanism, it is characterised in that: include:

Step 1 segments a document sets to it first, part-of-speech tagging and stop words is gone to handle, and recording word Word frequency information, while the synonym in document is replaced, then, the training of each word is given birth to using word2vec tool At term vector, tf-idf value is calculated to the word got, assigns relevant power respectively according to the part of speech of word and tf-idf value Restore the statistical characteristics of the word；

Event in every step 2, extraction document, and calculate the statistical characteristics of event and the Attention power based on event Value；

Step 4 carries out model training, and is tested using finally obtained training result test text to obtain classification knot Fruit.

2. the text classification algorithm of fusion statistical nature and Attention mechanism according to claim 1, feature exist In: the step 1 specifically includes the following steps:

Step 1 does participle and part-of-speech tagging processing to document sets using Chinese word segmenting tool NLPIR, is then deactivated using Chinese Word vocabulary rejects stop words in document；

Step 2 uses Harbin Institute of Technology's " Chinese thesaurus " extended edition as semantic dictionary, and the near synonym in document are replaced all with Word is represented, final text input sequence is obtained；

Step 4 calculates its tf-idf value to each term vector that training generates, and is calculated according to the part of speech of word and tf-idf value Obtain the statistical characteristics of the word, calculation are as follows: W_i=pos_w*pos_i+tfidf_w*tfidf_i, wherein pos_iIndicate word Part of speech value, the value of each weight are as follows: pos_w=0.5, tfidf_w=0.8.

3. the text classification algorithm of fusion statistical nature and Attention mechanism according to claim 1, feature exist In: the step 2 specifically includes the following steps:

Step 1, given document carry out dependency analysis to each sentence in document using the dependency analysis tool of Stanford, Obtain the dependency structure of every sentence；Then event is extracted using two kinds of dependences of nsubj and dobj, if two Nsubj and dobj relationship possesses identical predicate, then can be merged into an event, indicates <subj with a triple, Verb, obj >, relationship is not merged for the part in interdependent result, is still left binary group event；

The event that step 2, basis are extracted, obtaining the corresponding vector of event indicates:In formula, x_subj、 x_verb、x_objThe vector for respectively indicating subject in event, predicate and object indicates, calculates the event e in text₁,e₂,e₃,……, e_tTo the influence power weight of article totality, the effect of critical event can be protruded, it is whole to article semantic to reduce critical events Influence, the semantic coding of attention distribution probability calculates as follows:

Wherein a_kiIndicate node i relative to the attention weight integrally inputted, e_kiIndicate that the vector for entering and leaving event sequence indicates, T For the number of the Event element of list entries, h_kThe corresponding hiding layer state of X` is inputted for whole event；h_iIndicate list entries the The corresponding hidden layer state value of i Event element, v, W, U are weight matrix, and b is offset parameter, and tanh function is used as activation Function；

4. the text classification algorithm of fusion statistical nature and Attention mechanism according to claim 1, feature exist In: the step 3 specifically includes the following steps:

Step 1, fusion statistical nature and Attention weight, A_ki=T_w*T_i+A_w*a_ki, a in formula_kiFor the Attention of event Weight, T_iFor the statistical characteristics of event, T_wWith A_wIndicate that the statistical nature for being respectively event and Attention weight distribute one Fixed specific gravity, wherein T_wValue is 1, A_wValue is 2.5；

Step 2 obtains semantic coding C by event criticality and the cumulative of hiding layer state product,A in formula_ki For the weight of event obtained in above-mentioned steps 1, h_iFor two-way length, memory network hidden layer state value, T are to wrap in document in short-term The event number contained；

Step 3, the semantic coding C that will be obtained, two-way length memory network hidden layer state value h in short-term_kAnd the average input X` of text As the input of two-way length memory network module in short-term, H_k`=H (C, h_k, X`) and it is that the final vector of document indicates.

5. the text classification algorithm of fusion statistical nature and Attention mechanism according to claim 1, feature exist In: the step 4 specifically includes the following steps: