CN109918667B

CN109918667B - Quick incremental classification method for short text data stream based on word2vec model

Info

Publication number: CN109918667B
Application number: CN201910169255.3A
Authority: CN
Inventors: 李培培; 胡阳; 胡学钢
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2019-03-06
Filing date: 2019-03-06
Publication date: 2023-03-24
Anticipated expiration: 2039-03-06
Also published as: CN109918667A

Abstract

The invention discloses a distributed short text data stream fast increment classification method based on a word2vec model, which comprises the following steps: 1, acquiring an external corpus from a knowledge base to construct a wood 2Vec model and acquiring a word vector set Vec;2, constructing a dynamic linear LR classifier group and a global class label set; 3, updating the global label set in an incremental mode, and modifying and adding a classifier group; 4, predicting samples, acquiring a new word sample set which is not in the word set Vocab, and constructing a word vector model and a word vector expansion set; and 5, acquiring the class label strength of the test data stream. The method can improve the classification accuracy of the short text data stream and reduce the time consumption of model construction, thereby achieving the purpose of rapidly adapting to the classification of the short text data stream.

Description

Quick incremental classification method for short text data stream based on word2vec model

Technical Field

The invention belongs to the field of short text data stream mining, and particularly relates to a method for classifying a fast infinite short text data stream with continuously changing class labels.

Background

With the rapid development of information technologies such as Web2.0 and the like, a mass, high-speed and dynamic data-data stream emerges in the practical application fields such as social networks, online shopping, sensor networks and the like. In the social field, due to the popularization of social network media and forums, short-length texts are rushing into our lives, such as comments of users and interactions in forums in microblogs and tweets. The short texts contain a large amount of information in various fields, such as sports, education, science, and the like. Compared with the common text, the short text has sparsity, instantaneity, mass and nonstandard properties. For example, a Sing microblog defines 140 characters, while more may have only one sentence or even one phrase. And real-time updating and dynamic variability are realized on the network, and subject evolution occurs, such as the change of a hot ranking list and hot words on a microblog. And users usually generate a large amount of short texts and the data amount is increased rapidly in the interaction of the network platform. According to incomplete statistics, the average of user comments of the current mainstream interaction platform (such as microblog Facebook and the like) can reach 346 pieces of comment data per second. This will make it necessary for the short text processor to have a commensurate amount of data processing throughput, otherwise over time, it will result in a large accumulation of data. The above problems make the related short text classification method and data flow classification method face serious challenges:

one of the challenges is: conventional short text classification methods typically extend short text using an external corpus and then classify the short text using conventional classification methods, such as naive Bayes

Support Vector Machines (SVMs), decision trees, and the like, or directly expand short texts by using self-implicit statistical information to classify, such as LDA + KNN classification. But they generally belong to batch processing procedures and are difficult to process the short text data which continuously comes in time.

The second challenge is: short text data streams generate a large amount of short text data at high speed every moment, which causes a serious high-dimensional problem, and the existing text classification method rarely solves the problem.

The third challenge: due to its real-time variability, short text streams often undergo topic evolution, and the current data stream classification problem does not pay attention to the situation.

The fourth challenge: existing short text data stream classification is mostly done using batch classifiers such as (SVM,

bayes, KNN) and the like to be applied to short text data stream classification. The basic classifier is designed and applied to batch processing essentially, and an optimal solution is obtained through iteration, so that the data training time cost is high, delay accumulation is easily caused to the processing of data blocks, and the basic classifier is difficult to be applied to the training of high-speed short text data streams and is difficult to expand.

Disclosure of Invention

In order to avoid the defects in the prior art, the invention provides a quick incremental classification method of short text data streams based on a word2vec model, so that the classification accuracy of the short text data streams can be improved, the time consumption of model construction is reduced, and the purpose of quickly adapting to the classification of the short text data streams is achieved.

In order to achieve the aim, the invention adopts the following technical scheme:

the invention relates to a fast incremental classification method of short text data streams based on a word2vec model, which is characterized by comprising the following steps:

step 1: constructing a word2vec model:

step 1.1: obtaining an external corpus of text from a knowledge base, marked as C ¹ ＝{d ₁ ,d ₂ ...d _m ...d _M M =1,2.. M, where M represents the text external corpus C ¹ Total number of texts of d _m Represents the m-th text and has

I =1,2.. I, I denotes the mth text d _m The number of the Chinese word->

Representing said mth text d _m The ith word, and>

vocab represents an external corpus of text C ¹ Set of all different words in (1), and Vocab = { w = { w = ₁ ,w ₂ ....w _z ....w _Z Z, Z representing the total number of words in the set of words Vocab, w _z Represents the z-th word in the word set Vocab and leads the z-th word w _z The word vector of (c) is denoted as Vec (w) _z ) Thus, for the set of word vectors Vec = { Vec (w) ₁ ),Vec(w ₂ )…Vec(w _z )…Vec(w _Z ) Fifthly, randomly initializing the word vector set Vec;

step 1.2: counting that all words in the word set Vocab are outside the textCorpus C ¹ The number of occurrences in (1), i.e. word frequency, is less than or equal to the lower frequency limit F ₁ Or greater than or equal to the upper frequency limit F ₂ Deleting the word frequency to obtain a filtered word frequency set; sorting the filtered word frequency sets in a descending order, and constructing a Huffman tree T for the sorted word frequency sets;

step 1.3: setting the window size to window, and traversing the text external corpus C ¹ M-th text d _m And obtaining the current ith word

Is greater than or equal to>

The known current word is calculated by equation (1) as

When it is predicted that its context is->

Conditional probability of (2)

In the formula (1), u represents a context

In any word, in>

Indicating that the current word is known to be pick>

Then, the conditional probability of the word u is predicted, and the following are:

in the formula (2), H is the height of Huffman tree T, d _h The h-th leaf node in the path from the root node to the leaf node corresponding to the word u corresponds to the h-th digit in the Huffman code; theta _h A parameter corresponding to the h-th leaf node in a path from the root node to the leaf node corresponding to the word u;

calculating the ith word using equation (3)

Is greater than or equal to>

At the parameter theta _h By the presence of the h-th digit d _h Has a probability of->

In formula (3), σ (·) represents an excitation function;

step 1.4: calculating the probability using equation (4)

Is greater than or equal to the maximum likelihood function->

/>

Step 1.5: for maximum likelihood function

Parameter θ of _h And word vector pick>

Respectively obtaining corresponding gradients for updating the parameter theta _h And word vector pick>

To obtain an updated parameter theta' _j Sum word vector>

Further obtaining an updated word vector set Vec 'as a word2Vec model and assigning the updated word vector set Vec' to the original word vector set Vec;

step 2: constructing a dynamic linear LR classifier Model and a global class label set L:

step 2.1: defining a time t, and initializing t =0;

step 2.2: acquiring a data block at time t in any data stream

N =1,2.. N, where N is a data block D at time t _t The total number of texts in (4), and->

For a data block D at time t _t The nth text in>

For a data block D at time t _t Class label of the nth text in order of L _t Representing a block D at time t _t And a class label set of (c) and having: />

For time t class label set L _t Class j ofJ, J being a data block D at time t _t The total number of different types of labels in the label list;

step 2.3 class Label set L _t Put into the global class label set L and then traverse the nth text

Obtaining a word vector of each word, then calculating an average value, and taking the average value as the nth text->

Is greater than or equal to>

Thereby obtaining a data block D at the time t _t The text vectors of all texts in the text table are replaced, and the data block D at the time t is replaced _t The replaced data block at the t moment is obtained>

Step 2.4: class label set L according to time t _t The replaced data block D at the time t _t Polymerizing to obtain a data block at the t time after polymerization

Wherein,

for aggregated data block D at time t _t Wherein all class tags are->

A set of text vectors of;

step 2.5: creating and randomly initializing the jth parameter vector at time t

And &>

Thus obtaining the set of parameter vectors at time t>

And &>

Step 2.6: initializing j =1;

step 2.7: aggregating the t time data block D _t Wherein all classes are labeled as

Set of text vectors

Is assigned a value of 1 and is taken as a positive training sample, the remaining class labels are not->

Assigning a class label corresponding to the text vector set to 0, and using the class label as a negative training sample;

constructing a training text vector set by the positive training sample and the negative training sample

Wherein->

Step 2.8: when t >0, perform step 2.10, otherwise perform step 2.9;

step 2.9: random initialization class labels as

Is taken into consideration>

Step 2.10: traversal training textVector set Train ^t And the jth parameter vector at the t moment is calculated according to the equations (5) and (6)

And &>

Updated to the jth parameter vector at time t +1>

And &>

In the formulae (5) and (6),

set ≥ for text vector>

Nth text vector at middle t moment

Is optimized for the iteration parameter factor of->

Set ≥ for text vector>

The nth text vector at time instant t->

Is based on the loss function>

Weight the characteristic->

And has:

in the formula (8), the reaction mixture is,

for the text vector at time t->

Is predicted probability of, and

step 2.11: j +1 is assigned to J, the step 2.8 is returned until J is larger than J, and therefore a parameter vector set NP at the moment of t +1 is obtained ^t+1 And ZP ^t+1 ；

Step 2.12: traversing t +1 moment parameter vector set NP ^t+1 And ZP ^t+1 And initializing j =1;

step 2.13: calculating the jth characteristic weight at the t +1 moment by using the formula (10)

In formula (10), λ ₁ 、λ ₂ Alpha and beta are both hyper-parameters, zgn (·) is an upward rounding function;

step 2.14: using the jth feature weight at time t +1

Construction of a Linear LR classifier>

And the jth classifier at time t +1 is combined with>

Dynamic linear LR classifier Model added at time t +1 ^t+1 Performing the following steps;

step 2.15: j +1 is assigned to J, and the step 2.13 is returned until J is larger than J, so that the Model of the dynamic linear LR classifier set at the moment of t +1 is obtained ^t+1 Is composed of

And step 3: incrementally updating the global label set L, modifying and adding the classifier Model at the t +1 moment ^t+1 ；

Step 3.1: initializing t =1;

step 3.2: according to the processing method from the step 2.2 to the step 2.3, the replaced data block D at the time t is obtained _t ；

Step 3.3: traverse the data block D at time t _t The nth text

Is greater than or equal to>

And obtains the text vector ≥ using equation (11)>

Is selected based on the jth class label->

Has a probability of->

Resulting in a text vector ≥>

Is based on the set of class label probabilities +>

Step 3.4: class label probability aggregation

Using the class label corresponding to the medium maximum probability as a text vector

Is marked as->

Thereby obtaining a data set D at the time t _t The prediction class label set ≥ of all text vectors in>

Step 3.5: the classifier Model at time t is calculated by equation (12) ^t For the data block D at the current time t _t Accuracy of prediction (Accuracy) _t ：

Step 3.6: according to the processing method in the step 2.4, the aggregated data block D at the time t is obtained _t ；

Step 3.7: defining the total iteration number as Inter if Accuracy _t Less than a threshold λ, inter = num will be initialized; num is the set drift weight; otherwise, initializing Inter =1;

step 3.8: defining the current iteration number as an inter, and initializing the inter =1;

step 3.9: obtaining a parameter vector NP of the first inter iteration at the time of t +1 according to the processing method from the step 2.6 to the step 2.11 ^t+1,inter And ZP ^t+1,inter ；

Step 3.10: the NP is ^t+1,inter And ZP ^t+1,inter Assign values to NPs, respectively ^t And ZP ^t ；

Step 3.11: assigning Inter +1 to the Inter, and returning to the step 3.9 for execution until the Inter is greater than the Iter, thereby obtaining the jth parameter vector of the Inter iteration at the moment of t +1

And &>

And assigning to a parameter vector set NP at the moment t +1 ^t+1 And ZP ^t+1 ；

Step 3.12: traversing the t +1 moment parameter vector set NP of the set ^t+1 And ZP ^t+1 And initializing j =1;

step 3.13: calculating the jth characteristic weight at the t +1 moment by using the formula (10)

Step 3.14: if the jth classifier at the moment of t +1

Will->

Is assigned to

Thereby updating the classifier Model at time t +1 ^t+1 ；

If the jth classifier at the moment of t +1

Then the jth characteristic weight is used at the moment t + 1->

Construction of a Linear LR classifier->

Will then->

Classifier Model added at time t +1 ^t+1 Performing the following steps;

step 3.15: j +1 is assigned to J, and the step 3.13 is returned to execute until J is larger than J; thereby obtaining a dynamic linear LR classifier Model at the time of t +1 ^t+1 Is composed of

Step 3.16: assigning T +1 to T, and returning to the step 3.2 until T is greater than T, wherein T is the ending moment of the data stream;

and 4, step 4: predicting test data flow, acquiring a new word sample set S which does not appear in a word set Vocab, and constructing a word vector model C ^e And a word vector expansion set Vec-expanded;

step 4.1: defining a word vector expansion set Vec-expanded and initializing to an empty set, wherein t =0 is initialized;

step 4.2: obtaining test data block at time t in test data stream

Wherein->

Is a test data block D 'at the moment of t' _t The nth test text and->

Is->

The ith word in the nth test text at the time t;

defining a set S of new words text at time t ^t And initialized to be an empty set, and then test data block D 'at t moment is traversed' _t If the nth test text at the time t is the test text of all the test texts

All the words in (b) belong to the set of words Vocab, the nth test text is not asserted>

Adding a new word text set S ^t In, otherwise, the nth test text is ≥ h>

Adding a new word text set S ^t (ii) a Thereby obtaining an updated new word text set S 'at the time t' ^t And a corresponding new word set Vocab' ^t ；

Step 4.3: if the new word set at the time t is the text set S ^t If the total number of the test texts in the text set is greater than the threshold value, the updated new word text set S 'is processed by the processing method in the step 1' ^t Building a word2vec model and obtaining a new word vector set

Wherein Vec ^* (w′ _y,t ) Representing a new word set Vocab 'at time t' ^t The y-th word w' _y,t The word vector of (2); y =1,2.. Y, Y is a set of new words Vocab' ^t The total number of Chinese words; otherwise, executing step 5;

step 4.4: go through

If Vec ^* (w′ _y,t ) E is the Vec-expand, then the Vec is added ^* (w′ _y,t ) Assign to Vec ^* (w′ _y,t+1 ) Otherwise Vec will be ^* (w′ _y,t ) Adding the word vector expansion set Vec-expanded;

and 5: test data block D 'at moment t is obtained' _t Class label strength of (2):

step 5.1: obtaining a test data block D 'at the time t by utilizing the processing methods from the step 3.2 to the step 3.4' _t Class label probability set of all test text vectors in

And a set of predicted class labels

And are combined into

Wherein->

Test data block D 'representing time t' _t The nth test text->

Predicted class label is->

Has a probability of->

And step 5.2: acquisition type label probability set Pro' ^t Fractional set of all class labels in

Wherein->

Is a class label probability set Pro' ^t A medium prediction class label is->

Test text number of (2) divided by class tag probability set Pro' ^t The ratio of the total test text number in the test text;

acquiring class label probability set Pro' ^t Maximum probability set max _ Pro of all predicted class tags in ^t ＝{max_pro ₁ ,max_pro ₂ ...max_pro _j ...max_pro _J Where max _ pro _j Is a class label probability set Pro' ^t The middle class label is

The corresponding maximum probability;

step 5.3: will present ^t And max _ Pro ^t As test data block D' _t Class label strength topic of ^t ；

Step 5.4: assigning T +1 to T, and returning to the step 4.2 until T is greater than T ', wherein T' is the end time of the test data stream; thereby obtaining a label-like strength set (topic) of the test data stream ₁ ,topic ₂ ...topic _t ,...topic _T′ }。

Compared with the prior art, the invention has the following beneficial effects:

1. the method considers the characteristics of short text data length and less information, expands the short text by building a neural network word2Vec word vector model by means of external corpora, maps each word to an appointed dimension vector, obtains the word meaning correlation degree of the word and the word, reduces the sparsity problem of the short text, and improves the accuracy of short text classification, and simultaneously, considers the problem that a new word appears when a short text data stream is processed, designs a word vector expansion library Vec-expand, and builds an incremental word2Vec model for the new word. And a word vector expansion library is introduced, so that the dependence on an external corpus is reduced, and the stability of the algorithm is improved. The problem that the vectorization of the text excessively depends on the corpus is solved.

2. The invention designs a classifier group based on a distributed mode, is easy to use and has strong portability, and constructs a fast incremental linear LR classifier group by means of FTRL algorithm, thereby being suitable for the problems of fast training data flow, unbalanced class label distribution and the like.

3. The invention aims to acquire the change of the strength of the class label of the data stream along with time. And acquiring the predicted class label and the maximum probability of each predicted data block to judge the change of the theme strength in the unknown data stream, acquiring the accuracy of the global classifier for each training data block, and repeatedly training the data block to improve the precision of the classifier if the accuracy is lower than a specified threshold.

4. The invention is directed to the field of practical application as follows: on-line short text processing, organizing the huge number of short text data lacking structure into standard text data; the online modeling and automatic classification can be carried out on the high-speed short text data stream on the Internet; visualization of online short text data stream topic distribution; public opinion monitoring is used for acquiring the change curve recommendation of the popularity of the user on the whole platform, and the like, and the application is wide.

Drawings

FIG. 1 is a block diagram of the short text data stream classification of the present invention;

FIG. 2 is a diagram illustrating extended word vector training in accordance with the present invention;

FIG. 3 is a diagram illustrating an updating process of the incremental classifier according to the present invention.

Detailed Description

In this embodiment, as shown in fig. 1, a distributed short text data stream fast incremental classification method based on a word2vec model is mainly used to solve the problem of high-speed short text data stream classification with continuously changing class label distribution, and performs word segmentation on a chinese data stream, specifically, directly performs the following steps on an english data stream:

step 1: the word2vec model was constructed as per FIG. 2:

step 1.1: obtaining an external corpus of text from a knowledge base, denoted C ¹ ＝{d ₁ ,d ₂ ...d _m ...d _M M =1,2.. M, where M represents an external corpus of text C ¹ Total number of texts of d _m Represents the m-th text and has

I =1,2.. I, I denotes the mth text d _m The number of the Chinese word->

Representing the m-th text d _m The ith word, and>

vocab represents an external corpus of text C ¹ Word set of all different words in (1), and Vocab = { w = { (w) ₁ ,w ₂ ....w _z ....w _Z Z, Z representing the total number of words in the set of words Vocab, w _z Represents the z-th word in the word set Vocab and leads the z-th word w _z The word vector of (c) is denoted as Vec (w) _z ) Thus, for the set of word vectors Vec = { Vec (w) ₁ ),Vec(w ₂ )…Vec(w _z )…Vec(w _Z ) Fifthly, randomly initializing a word vector set Vec;

step 1.2: where Vocab is the set of all words (i.e., word list) in the external corpus. Vec is a full 0 array of size Vocab multiplied by the feature dimension, i.e., the vector for the z-th word in Vocab is Vec (w) _z ) = Vec [ z × characteristic dimension, (z + 1) × characteristic dimension]；

Step 1.3: corpus C with all words in statistical word set Vocab outside text ¹ The number of occurrences in (1), i.e. word frequency, is less than or equal to the lower frequency limit F ₁ Or greater than or equal to the upper frequency limit F ₂ Deleting the word frequency to obtain a filtered word frequency set; sorting the filtered word frequency sets in a descending order, and constructing a Huffman tree T for the sorted word frequency sets;

step 1.4 initialize m =0;

step 1.5: setting window size to window, traversing text external corpus C ¹ M-th text d _m And obtaining the current ith word

Is greater than or equal to>

The known current word is calculated by equation (1) as

When it is predicted that its context is->

Conditional probability of (2)

In the formula (1), u represents a context

Any word in it, is taken up or taken off>

Indicating that the current word is known to be pick>

in the formula (2), H is the height of Huffman tree T, d _h The h-th leaf node in the path from the root node to the leaf node corresponding to the word u corresponds to the h-th digit in the Huffman code; theta _h Leaves corresponding to the root node to the word uThe parameters corresponding to the h leaf node in the path between the nodes;

calculating the ith word using equation (3)

Is greater than or equal to>

At the parameter theta _h By the action of (3) a h-th digit d _h Has a probability of->

In formula (3), σ (·) represents an excitation function;

step 1.6: calculating probability using equation (4)

Is greater than or equal to the maximum likelihood function->

/>

Step 1.7: for maximum likelihood function

Parameter θ of _h And word vector pick>

To obtain an updated parameter theta' _j And word vector pick>

step 1.8: and assigning M +1 to M, and returning to the step 1.5 until M is greater than M, so as to obtain a final word vector Vec.

In this embodiment, a word2vec model containing word vectors as shown in table 1 is trained from an external corpus. Where the feature dimension =50, the vocab size =171678.

TABLE 1word2vec word vectors

Step 2: constructing a dynamic linear LR classifier Model and a global class label set L according to FIG. 3:

step 2.1: defining a time t, and initializing t =0;

step 2.2: acquiring a data block at time t in any data stream

N =1,2.. N, where N is a data block D at time t _t The total number of texts in (a), based on the number of text in (b), and/or based on the number of text in (b)>

For a data block D at time t _t The nth text in>

For a data block D at time t _t Class label of the nth text in order of L _t Representing a block of data D at time t _t And a class label set of (c) and having: />

For time t class label set L _t J =1,2.. J, which is a data block D at time t _t The total number of different types of labels in the label list;

Is greater than or equal to>

Wherein,

for aggregated data block D at time t _t In all class tags is->

A set of text vectors of;

step 2.5: creating and randomly initializing the jth parameter vector at time t

And &>

Thus obtaining the set of parameter vectors at time t>

And &>

Step 2.6: initializing j =1;

Set of text vectors

constructing training text vector set by positive training samples and negative training samples

Wherein->

Step 2.8: when t >0, perform step 2.10, otherwise perform step 2.9;

step 2.9: random initialization class labels as

Is taken into consideration>

Step 2.10: trainting training text vector set Train ^t And the jth parameter vector at the t moment is calculated according to the equations (5) and (6)

And &>

Updated to the jth parameter vector at time t +1>

And &>

In the formulae (5) and (6),

set ≥ for text vector>

Nth text vector at time t

Is optimized for the iteration parameter factor of->

Set ≥ for text vector>

The nth text vector at time instant t->

Is based on the loss function>

Weight the characteristic->

And has:

/>

in the formula (8), the reaction mixture is,

for the text vector at time t->

Is predicted probability of, and

step 2.14: using the jth feature weight at time t +1

Construction of a Linear LR classifier->

And the jth classifier at the moment t +1 is->

step 2.15: j +1 is assigned to J, and the step 2.13 is returned until J is larger than J, so that the dynamic linear LR classifier Model at the moment of t +1 is obtained ^t+1 Is composed of

Step 3.1: initializing t =1;

Step 3.3: traverse the data block D at time t _t The nth text

Is greater than or equal to>

And obtains the text using equation (11)Vector->

Is selected based on the jth class label->

Has a probability of->

Resulting in a text vector ≥>

Is based on the set of class label probabilities +>

Step 3.4: class label probability aggregation

Is marked as->

Step 3.5: the classifier Model at time t is calculated by equation (12) ^t For the data block D at the current time t _t Accuracy of prediction of (Accuracy) _t ：

Step 3.10: NP is ^t+1,inter And ZP ^t+1,inter Assign values to NPs, respectively ^t And ZP ^t ；

Step 3.11: assigning Inter +1 to the Inter, and returning to the step 3.9 to execute until the Inter is greater than the Iter, thereby obtaining the jth parameter vector of the Inter iteration at the moment of t +1

And &>

Step 3.14: if the jth classifier at the moment of t +1

Will then >>

Is assigned to

Thereby updating the classifier Model at time t +1 ^t+1 ；

If the jth classifier at the moment of t +1

Then the jth feature weight at time t +1 is used

Construction of a Linear LR classifier->

Will then->

Classifier Model added at time t +1 ^t+1 Performing the following steps;

step 3.15: j +1 is assigned to J, and the step 3.13 is returned to execute until J is larger than J; thereby obtaining the dynamic linear LR classifier Model at the time of t +1 ^t+1 Is composed of

Step 3.16: assigning T +1 to T, and returning to the step 3.2 until T is greater than T, wherein T is the ending time of the data stream;

step 4.2: obtaining test data block at time t in test data stream

Wherein->

Is a test data block D 'at the moment of t' _t The nth test text and->

Is->

The ith word in the nth test text at the time t;

Adding a new word text set S ^t In, otherwise, the nth test text is ≥ h>

Adding a new word text set S ^t (ii) a Thereby obtaining an updated new word text set S 'at the time t' ^t And a corresponding new word set Vocab' ^t ；/>

Step 4.3: if new word set text set S ^t If the total number of the test texts is greater than the threshold value, the updated new word text set S 'is processed by the processing method in the step 1' ^t Building a word2vec model and obtaining a new word vector set

Wherein Vec ^* (w′ _y,t ) Representing a new word set Vocab 'at time t' ^t The y-th word w' _y,t The word vector of (2); y =1,2.. Y, Y is a set of new words Vocab' ^t The total number of Chinese words; whether or notThen, step 5 is executed;

step 4.4: go through

If Vec ^* (w′ _y,t ) E, the Vec is reserved ^* (w′ _y,t ) Assign to Vec ^* (w′ _y,t+1 ) Otherwise Vec will be ^* (w′ _y,t ) Adding the word vector expansion set Vec-expanded;

And a set of predicted class labels

And are combined into

Wherein->

Test data block D 'representing time t' _t The nth test text->

The predicted class label is->

Has a probability of->

Step 5.2: acquiring class label probability set Pro' ^t Fractional set of all class labels in

Wherein->

Is a class label probability set Pro' ^t A medium prediction class label is->

The corresponding maximum probability;

step 5.3: will present ^t And max _ Pro ^t Is taken as test data block D' _t Class label strength topic of ^t ；

Step 5.4: assigning T +1 to T, and returning to the step 4.2 until T is greater than T ', wherein T' is the end time of the test data stream; thereby obtaining the label-like strength set { topic) of the test data stream ₁ ,topic ₂ ...topic _t ,...topic _T′ }。

Example (b): according to the classification method of short text data stream, as shown in fig. 1, the method specifically comprises the following steps:

(1) Obtaining data block D at time t of data stream _t Wherein the data are mainly shown in table 2;

table 2 short text data stream 15 sample data

/>

(2) For the first training data block D ₁ Initializing L = { }, model = { };

(3) Obtaining D ₁ Class label L of ₁ = Health, sci _ tech, entertainmen, and will L ₁ Adding a global variable L = { Health, sci _ tech, entretainmen };

(4) Each text is converted to a text vector based on the Vec word vector set (see Table 1), e.g.

I.e. using word vector models to map D ₁ Conversion to table 3;

TABLE 3 vectorized data blocks

/>

(5) Will D ₁ Grouped according to class labels, i.e.

(6) Go through

Initializing NP, ZP and the corresponding feature weight->

I.e. is>

(7) Traversal set NP ⁰ ZP ⁰ . For NP ₁ ⁰ ZP ₁ ⁰ Random initialization

Based on a positive training set as->

The negative training set is->

An FTRL update is performed. New NP after traversing ₁ ⁰ ZP ₁ ⁰ Updating to NP and ZP. Repeatedly performs>

Updating NP and ZP;

(8) Traversing NP, ZP in (5), and formula (10) wherein λ ₁ ＝0.01β＝0.01，λ ₁ ＝0.01λ ₂ =0.01 calculates weight and constructs a linear classifier model using weight, thereby obtaining a model ₁ ,model ₂ ,model ₃ Adding a global classifier Model;

(9) For other data blocks D _t E.g. D ₂ Vectorizing D according to the processing method in the step (4) ₂ . Obtaining D ₂ Class label L of ₂ = Us, sport, world, health, business }. Since L = { Health, sci _ tech, entretainmen }, where Us, sport, world, business }, in which

L, namely adding Us, sport, world and Business into L, namely L = { Health, sci _ tech, entertainen, us, sport, world, business }.

(10) Traverse D _t Such as D ₂ For each one

Using Model = { M ₁ ,M ₂ ,M ₃ The probability, i.e. the result, of calculating each class label is shown in table 4:

table 4 data block prediction results

/>

Obtaining D ₂ The maximum possible class labels Lab = { entertaine, sci _ tech, entertaine, health, entertaine } and D in each text ₂ The set of mid class tag probabilities pro = {0.5,0.6,0.4,0.8,0.6}, and the calculated accuracy =20%.

(11) Will D ₂ Polymerization according to L to give

(12) Defining an initial total iteration number Iter, since accuracy =20% <50%, i.e. initialization Iter =3;

(13) Acquiring NP and ZP sets of the previous data block, and updating NP and ZP parameter sets after iterating and executing Iter times in the step (7);

(14) Calculating the characteristic weight of the classifier corresponding to each class label by using the parameter sets NP and ZP, namely

(15) In the step (8), only classifiers corresponding to labels Health, sci _ tech, entretainmen are in the Model classifier group. Adding an LR classifier constructed by the characteristic weight corresponding to Us, sport, world and Business into a Model, and updating the classifier weight corresponding to Health;

(16) Assigning t +1 to t, and returning to the step (9) until the data flow is finished;

(17) A new word text set S and a word vector expansion library Vec-expanded are defined and initialized to an empty set.

(18) For a data block D at the time t of the test data stream _t E.g. D _n First traverse D _n If, if

E.g. is->

Will->

In S, i.e. <' >>

If S is greater than>0, namely, using the processing method of step 1, the word2Vec model of S is obtained and put into the Vec-expanded expansion library. If the Vec-expand already has the current word vector, the Vec-expand is updated. Currently, vec-expand = { }, i.e., put the word vector of S into Vec-expand, i.e., vec-expand = { Vec (mariah), vec (carey), vec (join), vec (simon), vec (cowells), vec (factor) };

(19) Vectorizing D by using the processing method of step (4) _n In the treatment method using the step (10), D is obtained _n The predicted class label set Lab and the class label probability set Pro, where Lab = { bussiness. Us, bussiness, entertainement }, pro = { 0.9.7.0.9.0.5.7 };

(20) For the predicted Pro, obtaining Pro _t Wherein each class label is in D _t The present of (a) and the corresponding maximum probability set mx _ pro. For the above embodiment example D _n Namely present = { Business:0.4 us; max _ pro = { Business:0.9 us;

(21) And assigning t +1 to t, and returning to the step (18) until the test data flow is ended.

Claims

1. A fast incremental classification method of short text data streams based on a word2vec model is characterized by comprising the following steps:

step 1: constructing a word2vec model:

step 1.1: obtaining an external corpus of text from a knowledge base, denoted C ¹ ＝{d ₁ ,d ₂ ...d _m ...d _M M =1,2.. M, where M represents the text external corpus C ¹ Total number of texts of d _m Represents the m-th text and has

I denotes the mth text d _m The number of the Chinese words is equal to the number of the Chinese words,

representing said mth text d _m The ith word in, and

vocab represents an external corpus of text C ¹ Word set of all different words in (1), and Vocab = { w = { (w) ₁ ,w ₂ ....w _z ....w _Z Z, Z representing the total number of words in the set of words Vocab, w _z Represents the z-th word in the word set Vocab and leads the z-th word w _z The word vector of (c) is denoted as Vec (w) _z ) Thus, for the set of word vectors Vec = { Vec (w) ₁ ),Vec(w ₂ )…Vec(w _z )…Vec(w _Z ) Fifthly, randomly initializing the word vector set Vec;

step 1.2: counting all words in the word set Vocab in the text external corpus C ¹ The number of occurrences in (1), i.e. word frequency, is less than or equal to the lower frequency limit F ₁ Or greater than or equal to the upper frequency limit F ₂ Deleting the word frequency to obtain a filtered word frequency set; sorting the filtered word frequency sets in a descending order, and constructing a Huffman tree T for the sorted word frequency sets;

step 1.3: setting the window size to window, and traversing the text external corpus C ¹ M-th text d in _m And obtains each word of the current wordThe ith word

In the context of

The known current word is calculated by equation (1) as

Predict its context

Conditional probability of (2)

In the formula (1), u represents a context

Any one of the words in any one of the above,

indicating that the current word is known as

Then, the conditional probability of the word u is predicted, and the following are provided:

in the formula (2), H is the height of Huffman tree T, d _h Is that the h-th leaf node in the path from the root node to the leaf node corresponding to the word u corresponds to the Huffman codeThe h-th digit of (1); theta _h A parameter corresponding to the h-th leaf node in a path from the root node to the leaf node corresponding to the word u;

calculating the ith word using equation (3)

Word vector of

At the parameter theta _h By the presence of the h-th digit d _h Probability of (2)

In formula (3), σ (·) represents an excitation function;

step 1.4: calculating probability using equation (4)

Maximum likelihood function of

Step 1.5: for maximum likelihood function

Parameter θ of _h Sum word vector

Respectively calculating partial derivatives to obtain corresponding gradients for updating the parameter theta _h Sum word vector

To obtain an updated parameter theta' _j Sum word vector

step 2.1: defining a time t, and initializing t =0;

step 2.2: acquiring a data block at time t in any data stream

Wherein N is a data block D at time t _t The total number of texts in (a),

for a data block D at time t _t The (n) th text of the (n),

for a data block D at time t _t Class label of the nth text in order of L _t Representing a block of data D at time t _t And a class label set of (c) and having:

Obtaining a word vector of each word, then obtaining an average value, and using the average value as the nth text at the time t

Text vector of

Thereby obtaining a data block D at the time t _t The text vectors of all texts in the text table are replaced, and the data block D at the time t is replaced _t Obtaining the replaced t-time data block from each text

Wherein,

for aggregated data block D at time t _t Wherein all classes are labeled as

A set of text vectors of;

step 2.5: creating and randomly initializing the jth parameter vector at time t

And

thereby obtaining a parameter vector set at the time t

And

step 2.6: initializing j =1;

step 2.7: the aggregated data block D at the time t _t Wherein all classes are labeled as

Set of text vectors

The value of class label of (1) is assigned as a positive training sample, and the rest class labels are not

Wherein

Step 2.8: when t >0, perform step 2.10, otherwise perform step 2.9;

step 2.9: random initialization class labels as

Characteristic weight of

And

updating to the jth parameter vector at the moment of t +1

And

in the formulae (5) and (6),

for a set of text vectors

Nth text vector at middle t moment

The optimization of the iteration parameter factor of (2),

for a set of text vectors

Nth text vector at middle t moment

Loss function of

For feature weight

And has:

in the formula (8), the reaction mixture is,

for the text vector at the t moment

Is predicted probability of, and

step 2.14: using the jth feature weight at time t +1

Construction of Linear LR classifier

And the jth classifier at the t +1 moment

Step 3.1: initializing t =1;

Step 3.3: traverse the data block D at time t _t The nth text

Text vector of

And obtains a text vector using equation (11)

Class jth tag of

Probability of (2)

Thereby obtaining a text vector

Class label probability set of

Step 3.4: class label probability aggregation

Is marked as

Thereby obtaining a data set D at the time t _t Prediction class label set of all text vectors in

Step 3.5: the classifier Model at time t is calculated by equation (12) ^t In pair whenData block D at time t _t Accuracy of prediction of (Accuracy) _t ：

And

Step 3.14: if the jth classifier at the moment of t +1

Then will be

Is assigned to

Thereby updating the classifier Model at time t +1 ^t+1 ；

If the jth classifier at the moment of t +1

Then the jth feature weight at time t +1 is used

Construction of Linear LR classifier

Then will be

Classifier Model added at time t +1 ^t+1 Performing the following steps;

and 4, step 4: predicting test data flow, acquiring a new word sample set S which does not appear in a word set Vocab, and constructing a word vector model C ^e And a word vector expansion set Vec-expand;

step 4.2: obtaining test data block at time t in test data stream

Wherein

For test data block D at time t _t ' the n-th test text in, and

is composed of

The ith word in the nth test text at the time t;

defining a set S of new words text at time t ^t And initialized to an empty set, and then test data block D at t moment is traversed _t All test texts in' if the nth test text at time t

If all the words in the test text belong to the word set Vocab, the nth test text is not used

Adding new word text set S ^t Otherwise, testing the nth test text

Step 4.3: if the new word set at the time t is the text set S ^t If the total number of the test texts is greater than the threshold value, the updated new word text set S 'is processed by the processing method in the step 1' ^t Building a word2vec model and obtainingVector set to new words