CN109918667B - Quick incremental classification method for short text data stream based on word2vec model - Google Patents

Quick incremental classification method for short text data stream based on word2vec model Download PDF

Info

Publication number
CN109918667B
CN109918667B CN201910169255.3A CN201910169255A CN109918667B CN 109918667 B CN109918667 B CN 109918667B CN 201910169255 A CN201910169255 A CN 201910169255A CN 109918667 B CN109918667 B CN 109918667B
Authority
CN
China
Prior art keywords
text
word
time
vector
vec
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910169255.3A
Other languages
Chinese (zh)
Other versions
CN109918667A (en
Inventor
李培培
胡阳
胡学钢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN201910169255.3A priority Critical patent/CN109918667B/en
Publication of CN109918667A publication Critical patent/CN109918667A/en
Application granted granted Critical
Publication of CN109918667B publication Critical patent/CN109918667B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a distributed short text data stream fast increment classification method based on a word2vec model, which comprises the following steps: 1, acquiring an external corpus from a knowledge base to construct a wood 2Vec model and acquiring a word vector set Vec;2, constructing a dynamic linear LR classifier group and a global class label set; 3, updating the global label set in an incremental mode, and modifying and adding a classifier group; 4, predicting samples, acquiring a new word sample set which is not in the word set Vocab, and constructing a word vector model and a word vector expansion set; and 5, acquiring the class label strength of the test data stream. The method can improve the classification accuracy of the short text data stream and reduce the time consumption of model construction, thereby achieving the purpose of rapidly adapting to the classification of the short text data stream.

Description

Quick incremental classification method for short text data stream based on word2vec model
Technical Field
The invention belongs to the field of short text data stream mining, and particularly relates to a method for classifying a fast infinite short text data stream with continuously changing class labels.
Background
With the rapid development of information technologies such as Web2.0 and the like, a mass, high-speed and dynamic data-data stream emerges in the practical application fields such as social networks, online shopping, sensor networks and the like. In the social field, due to the popularization of social network media and forums, short-length texts are rushing into our lives, such as comments of users and interactions in forums in microblogs and tweets. The short texts contain a large amount of information in various fields, such as sports, education, science, and the like. Compared with the common text, the short text has sparsity, instantaneity, mass and nonstandard properties. For example, a Sing microblog defines 140 characters, while more may have only one sentence or even one phrase. And real-time updating and dynamic variability are realized on the network, and subject evolution occurs, such as the change of a hot ranking list and hot words on a microblog. And users usually generate a large amount of short texts and the data amount is increased rapidly in the interaction of the network platform. According to incomplete statistics, the average of user comments of the current mainstream interaction platform (such as microblog Facebook and the like) can reach 346 pieces of comment data per second. This will make it necessary for the short text processor to have a commensurate amount of data processing throughput, otherwise over time, it will result in a large accumulation of data. The above problems make the related short text classification method and data flow classification method face serious challenges:
one of the challenges is: conventional short text classification methods typically extend short text using an external corpus and then classify the short text using conventional classification methods, such as naive Bayes
Figure BDA0001987379750000011
Support Vector Machines (SVMs), decision trees, and the like, or directly expand short texts by using self-implicit statistical information to classify, such as LDA + KNN classification. But they generally belong to batch processing procedures and are difficult to process the short text data which continuously comes in time.
The second challenge is: short text data streams generate a large amount of short text data at high speed every moment, which causes a serious high-dimensional problem, and the existing text classification method rarely solves the problem.
The third challenge: due to its real-time variability, short text streams often undergo topic evolution, and the current data stream classification problem does not pay attention to the situation.
The fourth challenge: existing short text data stream classification is mostly done using batch classifiers such as (SVM,
Figure BDA0001987379750000012
bayes, KNN) and the like to be applied to short text data stream classification. The basic classifier is designed and applied to batch processing essentially, and an optimal solution is obtained through iteration, so that the data training time cost is high, delay accumulation is easily caused to the processing of data blocks, and the basic classifier is difficult to be applied to the training of high-speed short text data streams and is difficult to expand.
Disclosure of Invention
In order to avoid the defects in the prior art, the invention provides a quick incremental classification method of short text data streams based on a word2vec model, so that the classification accuracy of the short text data streams can be improved, the time consumption of model construction is reduced, and the purpose of quickly adapting to the classification of the short text data streams is achieved.
In order to achieve the aim, the invention adopts the following technical scheme:
the invention relates to a fast incremental classification method of short text data streams based on a word2vec model, which is characterized by comprising the following steps:
step 1: constructing a word2vec model:
step 1.1: obtaining an external corpus of text from a knowledge base, marked as C 1 ={d 1 ,d 2 ...d m ...d M M =1,2.. M, where M represents the text external corpus C 1 Total number of texts of d m Represents the m-th text and has
Figure BDA0001987379750000021
I =1,2.. I, I denotes the mth text d m The number of the Chinese word->
Figure BDA0001987379750000022
Representing said mth text d m The ith word, and>
Figure BDA0001987379750000023
vocab represents an external corpus of text C 1 Set of all different words in (1), and Vocab = { w = { w = 1 ,w 2 ....w z ....w Z Z, Z representing the total number of words in the set of words Vocab, w z Represents the z-th word in the word set Vocab and leads the z-th word w z The word vector of (c) is denoted as Vec (w) z ) Thus, for the set of word vectors Vec = { Vec (w) 1 ),Vec(w 2 )…Vec(w z )…Vec(w Z ) Fifthly, randomly initializing the word vector set Vec;
step 1.2: counting that all words in the word set Vocab are outside the textCorpus C 1 The number of occurrences in (1), i.e. word frequency, is less than or equal to the lower frequency limit F 1 Or greater than or equal to the upper frequency limit F 2 Deleting the word frequency to obtain a filtered word frequency set; sorting the filtered word frequency sets in a descending order, and constructing a Huffman tree T for the sorted word frequency sets;
step 1.3: setting the window size to window, and traversing the text external corpus C 1 M-th text d m And obtaining the current ith word
Figure BDA0001987379750000024
Is greater than or equal to>
Figure BDA0001987379750000025
The known current word is calculated by equation (1) as
Figure BDA0001987379750000026
When it is predicted that its context is->
Figure BDA0001987379750000027
Conditional probability of (2)
Figure BDA0001987379750000028
Figure BDA0001987379750000029
In the formula (1), u represents a context
Figure BDA00019873797500000213
In any word, in>
Figure BDA00019873797500000210
Indicating that the current word is known to be pick>
Figure BDA00019873797500000211
Then, the conditional probability of the word u is predicted, and the following are:
Figure BDA00019873797500000212
in the formula (2), H is the height of Huffman tree T, d h The h-th leaf node in the path from the root node to the leaf node corresponding to the word u corresponds to the h-th digit in the Huffman code; theta h A parameter corresponding to the h-th leaf node in a path from the root node to the leaf node corresponding to the word u;
calculating the ith word using equation (3)
Figure BDA0001987379750000031
Is greater than or equal to>
Figure BDA0001987379750000032
At the parameter theta h By the presence of the h-th digit d h Has a probability of->
Figure BDA0001987379750000033
Figure BDA0001987379750000034
In formula (3), σ (·) represents an excitation function;
step 1.4: calculating the probability using equation (4)
Figure BDA0001987379750000035
Is greater than or equal to the maximum likelihood function->
Figure BDA0001987379750000036
/>
Figure BDA0001987379750000037
Step 1.5: for maximum likelihood function
Figure BDA0001987379750000038
Parameter θ of h And word vector pick>
Figure BDA0001987379750000039
Respectively obtaining corresponding gradients for updating the parameter theta h And word vector pick>
Figure BDA00019873797500000310
To obtain an updated parameter theta' j Sum word vector>
Figure BDA00019873797500000311
Further obtaining an updated word vector set Vec 'as a word2Vec model and assigning the updated word vector set Vec' to the original word vector set Vec;
step 2: constructing a dynamic linear LR classifier Model and a global class label set L:
step 2.1: defining a time t, and initializing t =0;
step 2.2: acquiring a data block at time t in any data stream
Figure BDA00019873797500000312
N =1,2.. N, where N is a data block D at time t t The total number of texts in (4), and->
Figure BDA00019873797500000313
For a data block D at time t t The nth text in>
Figure BDA00019873797500000314
For a data block D at time t t Class label of the nth text in order of L t Representing a block D at time t t And a class label set of (c) and having: />
Figure BDA00019873797500000315
Figure BDA00019873797500000316
For time t class label set L t Class j ofJ, J being a data block D at time t t The total number of different types of labels in the label list;
step 2.3 class Label set L t Put into the global class label set L and then traverse the nth text
Figure BDA00019873797500000317
Obtaining a word vector of each word, then calculating an average value, and taking the average value as the nth text->
Figure BDA00019873797500000318
Is greater than or equal to>
Figure BDA00019873797500000319
Thereby obtaining a data block D at the time t t The text vectors of all texts in the text table are replaced, and the data block D at the time t is replaced t The replaced data block at the t moment is obtained>
Figure BDA00019873797500000320
Step 2.4: class label set L according to time t t The replaced data block D at the time t t Polymerizing to obtain a data block at the t time after polymerization
Figure BDA0001987379750000041
Wherein,
Figure BDA0001987379750000042
for aggregated data block D at time t t Wherein all class tags are->
Figure BDA0001987379750000043
A set of text vectors of;
step 2.5: creating and randomly initializing the jth parameter vector at time t
Figure BDA0001987379750000044
And &>
Figure BDA0001987379750000045
Thus obtaining the set of parameter vectors at time t>
Figure BDA0001987379750000046
And &>
Figure BDA0001987379750000047
Step 2.6: initializing j =1;
step 2.7: aggregating the t time data block D t Wherein all classes are labeled as
Figure BDA0001987379750000048
Set of text vectors
Figure BDA0001987379750000049
Is assigned a value of 1 and is taken as a positive training sample, the remaining class labels are not->
Figure BDA00019873797500000410
Assigning a class label corresponding to the text vector set to 0, and using the class label as a negative training sample;
constructing a training text vector set by the positive training sample and the negative training sample
Figure BDA00019873797500000411
Wherein->
Figure BDA00019873797500000412
Step 2.8: when t >0, perform step 2.10, otherwise perform step 2.9;
step 2.9: random initialization class labels as
Figure BDA00019873797500000413
Is taken into consideration>
Figure BDA00019873797500000414
Step 2.10: traversal training textVector set Train t And the jth parameter vector at the t moment is calculated according to the equations (5) and (6)
Figure BDA00019873797500000415
And &>
Figure BDA00019873797500000416
Updated to the jth parameter vector at time t +1>
Figure BDA00019873797500000417
And &>
Figure BDA00019873797500000418
Figure BDA00019873797500000419
Figure BDA00019873797500000420
In the formulae (5) and (6),
Figure BDA00019873797500000421
set ≥ for text vector>
Figure BDA00019873797500000422
Nth text vector at middle t moment
Figure BDA00019873797500000423
Is optimized for the iteration parameter factor of->
Figure BDA00019873797500000424
Set ≥ for text vector>
Figure BDA00019873797500000425
The nth text vector at time instant t->
Figure BDA00019873797500000426
Is based on the loss function>
Figure BDA0001987379750000051
Weight the characteristic->
Figure BDA0001987379750000052
And has:
Figure BDA0001987379750000053
Figure BDA0001987379750000054
Figure BDA0001987379750000055
in the formula (8), the reaction mixture is,
Figure BDA0001987379750000056
for the text vector at time t->
Figure BDA0001987379750000057
Is predicted probability of, and
Figure BDA0001987379750000058
step 2.11: j +1 is assigned to J, the step 2.8 is returned until J is larger than J, and therefore a parameter vector set NP at the moment of t +1 is obtained t+1 And ZP t+1
Step 2.12: traversing t +1 moment parameter vector set NP t+1 And ZP t+1 And initializing j =1;
step 2.13: calculating the jth characteristic weight at the t +1 moment by using the formula (10)
Figure BDA0001987379750000059
Figure BDA00019873797500000510
In formula (10), λ 1 、λ 2 Alpha and beta are both hyper-parameters, zgn (·) is an upward rounding function;
step 2.14: using the jth feature weight at time t +1
Figure BDA00019873797500000511
Construction of a Linear LR classifier>
Figure BDA00019873797500000512
And the jth classifier at time t +1 is combined with>
Figure BDA00019873797500000513
Dynamic linear LR classifier Model added at time t +1 t+1 Performing the following steps;
step 2.15: j +1 is assigned to J, and the step 2.13 is returned until J is larger than J, so that the Model of the dynamic linear LR classifier set at the moment of t +1 is obtained t+1 Is composed of
Figure BDA00019873797500000514
And step 3: incrementally updating the global label set L, modifying and adding the classifier Model at the t +1 moment t+1
Step 3.1: initializing t =1;
step 3.2: according to the processing method from the step 2.2 to the step 2.3, the replaced data block D at the time t is obtained t
Step 3.3: traverse the data block D at time t t The nth text
Figure BDA00019873797500000515
Is greater than or equal to>
Figure BDA00019873797500000516
And obtains the text vector ≥ using equation (11)>
Figure BDA00019873797500000517
Is selected based on the jth class label->
Figure BDA00019873797500000518
Has a probability of->
Figure BDA00019873797500000519
Resulting in a text vector ≥>
Figure BDA00019873797500000520
Is based on the set of class label probabilities +>
Figure BDA0001987379750000061
Figure BDA0001987379750000062
Step 3.4: class label probability aggregation
Figure BDA0001987379750000063
Using the class label corresponding to the medium maximum probability as a text vector
Figure BDA0001987379750000064
Is marked as->
Figure BDA0001987379750000065
Thereby obtaining a data set D at the time t t The prediction class label set ≥ of all text vectors in>
Figure BDA0001987379750000066
Step 3.5: the classifier Model at time t is calculated by equation (12) t For the data block D at the current time t t Accuracy of prediction (Accuracy) t
Figure BDA0001987379750000067
Step 3.6: according to the processing method in the step 2.4, the aggregated data block D at the time t is obtained t
Step 3.7: defining the total iteration number as Inter if Accuracy t Less than a threshold λ, inter = num will be initialized; num is the set drift weight; otherwise, initializing Inter =1;
step 3.8: defining the current iteration number as an inter, and initializing the inter =1;
step 3.9: obtaining a parameter vector NP of the first inter iteration at the time of t +1 according to the processing method from the step 2.6 to the step 2.11 t+1,inter And ZP t+1,inter
Step 3.10: the NP is t+1,inter And ZP t+1,inter Assign values to NPs, respectively t And ZP t
Step 3.11: assigning Inter +1 to the Inter, and returning to the step 3.9 for execution until the Inter is greater than the Iter, thereby obtaining the jth parameter vector of the Inter iteration at the moment of t +1
Figure BDA0001987379750000068
And &>
Figure BDA0001987379750000069
And assigning to a parameter vector set NP at the moment t +1 t+1 And ZP t+1
Step 3.12: traversing the t +1 moment parameter vector set NP of the set t+1 And ZP t+1 And initializing j =1;
step 3.13: calculating the jth characteristic weight at the t +1 moment by using the formula (10)
Figure BDA00019873797500000610
Step 3.14: if the jth classifier at the moment of t +1
Figure BDA00019873797500000611
Will->
Figure BDA00019873797500000612
Is assigned to
Figure BDA00019873797500000613
Thereby updating the classifier Model at time t +1 t+1
If the jth classifier at the moment of t +1
Figure BDA0001987379750000071
Then the jth characteristic weight is used at the moment t + 1->
Figure BDA0001987379750000072
Construction of a Linear LR classifier->
Figure BDA0001987379750000073
Will then->
Figure BDA0001987379750000074
Classifier Model added at time t +1 t+1 Performing the following steps;
step 3.15: j +1 is assigned to J, and the step 3.13 is returned to execute until J is larger than J; thereby obtaining a dynamic linear LR classifier Model at the time of t +1 t+1 Is composed of
Figure BDA0001987379750000075
Step 3.16: assigning T +1 to T, and returning to the step 3.2 until T is greater than T, wherein T is the ending moment of the data stream;
and 4, step 4: predicting test data flow, acquiring a new word sample set S which does not appear in a word set Vocab, and constructing a word vector model C e And a word vector expansion set Vec-expanded;
step 4.1: defining a word vector expansion set Vec-expanded and initializing to an empty set, wherein t =0 is initialized;
step 4.2: obtaining test data block at time t in test data stream
Figure BDA00019873797500000714
Wherein->
Figure BDA0001987379750000076
Is a test data block D 'at the moment of t' t The nth test text and->
Figure BDA0001987379750000077
Is->
Figure BDA0001987379750000078
The ith word in the nth test text at the time t;
defining a set S of new words text at time t t And initialized to be an empty set, and then test data block D 'at t moment is traversed' t If the nth test text at the time t is the test text of all the test texts
Figure BDA0001987379750000079
All the words in (b) belong to the set of words Vocab, the nth test text is not asserted>
Figure BDA00019873797500000710
Adding a new word text set S t In, otherwise, the nth test text is ≥ h>
Figure BDA00019873797500000711
Adding a new word text set S t (ii) a Thereby obtaining an updated new word text set S 'at the time t' t And a corresponding new word set Vocab' t
Step 4.3: if the new word set at the time t is the text set S t If the total number of the test texts in the text set is greater than the threshold value, the updated new word text set S 'is processed by the processing method in the step 1' t Building a word2vec model and obtaining a new word vector set
Figure BDA00019873797500000712
Wherein Vec * (w′ y,t ) Representing a new word set Vocab 'at time t' t The y-th word w' y,t The word vector of (2); y =1,2.. Y, Y is a set of new words Vocab' t The total number of Chinese words; otherwise, executing step 5;
step 4.4: go through
Figure BDA00019873797500000713
If Vec * (w′ y,t ) E is the Vec-expand, then the Vec is added * (w′ y,t ) Assign to Vec * (w′ y,t+1 ) Otherwise Vec will be * (w′ y,t ) Adding the word vector expansion set Vec-expanded;
and 5: test data block D 'at moment t is obtained' t Class label strength of (2):
step 5.1: obtaining a test data block D 'at the time t by utilizing the processing methods from the step 3.2 to the step 3.4' t Class label probability set of all test text vectors in
Figure BDA0001987379750000081
And a set of predicted class labels
Figure BDA0001987379750000082
And are combined into
Figure BDA0001987379750000083
Wherein->
Figure BDA0001987379750000084
Test data block D 'representing time t' t The nth test text->
Figure BDA0001987379750000085
Predicted class label is->
Figure BDA0001987379750000086
Has a probability of->
Figure BDA0001987379750000087
And step 5.2: acquisition type label probability set Pro' t Fractional set of all class labels in
Figure BDA0001987379750000088
Wherein->
Figure BDA0001987379750000089
Is a class label probability set Pro' t A medium prediction class label is->
Figure BDA00019873797500000810
Test text number of (2) divided by class tag probability set Pro' t The ratio of the total test text number in the test text;
acquiring class label probability set Pro' t Maximum probability set max _ Pro of all predicted class tags in t ={max_pro 1 ,max_pro 2 ...max_pro j ...max_pro J Where max _ pro j Is a class label probability set Pro' t The middle class label is
Figure BDA00019873797500000811
The corresponding maximum probability;
step 5.3: will present t And max _ Pro t As test data block D' t Class label strength topic of t
Step 5.4: assigning T +1 to T, and returning to the step 4.2 until T is greater than T ', wherein T' is the end time of the test data stream; thereby obtaining a label-like strength set (topic) of the test data stream 1 ,topic 2 ...topic t ,...topic T′ }。
Compared with the prior art, the invention has the following beneficial effects:
1. the method considers the characteristics of short text data length and less information, expands the short text by building a neural network word2Vec word vector model by means of external corpora, maps each word to an appointed dimension vector, obtains the word meaning correlation degree of the word and the word, reduces the sparsity problem of the short text, and improves the accuracy of short text classification, and simultaneously, considers the problem that a new word appears when a short text data stream is processed, designs a word vector expansion library Vec-expand, and builds an incremental word2Vec model for the new word. And a word vector expansion library is introduced, so that the dependence on an external corpus is reduced, and the stability of the algorithm is improved. The problem that the vectorization of the text excessively depends on the corpus is solved.
2. The invention designs a classifier group based on a distributed mode, is easy to use and has strong portability, and constructs a fast incremental linear LR classifier group by means of FTRL algorithm, thereby being suitable for the problems of fast training data flow, unbalanced class label distribution and the like.
3. The invention aims to acquire the change of the strength of the class label of the data stream along with time. And acquiring the predicted class label and the maximum probability of each predicted data block to judge the change of the theme strength in the unknown data stream, acquiring the accuracy of the global classifier for each training data block, and repeatedly training the data block to improve the precision of the classifier if the accuracy is lower than a specified threshold.
4. The invention is directed to the field of practical application as follows: on-line short text processing, organizing the huge number of short text data lacking structure into standard text data; the online modeling and automatic classification can be carried out on the high-speed short text data stream on the Internet; visualization of online short text data stream topic distribution; public opinion monitoring is used for acquiring the change curve recommendation of the popularity of the user on the whole platform, and the like, and the application is wide.
Drawings
FIG. 1 is a block diagram of the short text data stream classification of the present invention;
FIG. 2 is a diagram illustrating extended word vector training in accordance with the present invention;
FIG. 3 is a diagram illustrating an updating process of the incremental classifier according to the present invention.
Detailed Description
In this embodiment, as shown in fig. 1, a distributed short text data stream fast incremental classification method based on a word2vec model is mainly used to solve the problem of high-speed short text data stream classification with continuously changing class label distribution, and performs word segmentation on a chinese data stream, specifically, directly performs the following steps on an english data stream:
step 1: the word2vec model was constructed as per FIG. 2:
step 1.1: obtaining an external corpus of text from a knowledge base, denoted C 1 ={d 1 ,d 2 ...d m ...d M M =1,2.. M, where M represents an external corpus of text C 1 Total number of texts of d m Represents the m-th text and has
Figure BDA0001987379750000091
I =1,2.. I, I denotes the mth text d m The number of the Chinese word->
Figure BDA0001987379750000093
Representing the m-th text d m The ith word, and>
Figure BDA0001987379750000092
vocab represents an external corpus of text C 1 Word set of all different words in (1), and Vocab = { w = { (w) 1 ,w 2 ....w z ....w Z Z, Z representing the total number of words in the set of words Vocab, w z Represents the z-th word in the word set Vocab and leads the z-th word w z The word vector of (c) is denoted as Vec (w) z ) Thus, for the set of word vectors Vec = { Vec (w) 1 ),Vec(w 2 )…Vec(w z )…Vec(w Z ) Fifthly, randomly initializing a word vector set Vec;
step 1.2: where Vocab is the set of all words (i.e., word list) in the external corpus. Vec is a full 0 array of size Vocab multiplied by the feature dimension, i.e., the vector for the z-th word in Vocab is Vec (w) z ) = Vec [ z × characteristic dimension, (z + 1) × characteristic dimension];
Step 1.3: corpus C with all words in statistical word set Vocab outside text 1 The number of occurrences in (1), i.e. word frequency, is less than or equal to the lower frequency limit F 1 Or greater than or equal to the upper frequency limit F 2 Deleting the word frequency to obtain a filtered word frequency set; sorting the filtered word frequency sets in a descending order, and constructing a Huffman tree T for the sorted word frequency sets;
step 1.4 initialize m =0;
step 1.5: setting window size to window, traversing text external corpus C 1 M-th text d m And obtaining the current ith word
Figure BDA0001987379750000101
Is greater than or equal to>
Figure BDA0001987379750000102
The known current word is calculated by equation (1) as
Figure BDA0001987379750000103
When it is predicted that its context is->
Figure BDA0001987379750000104
Conditional probability of (2)
Figure BDA0001987379750000105
Figure BDA0001987379750000106
In the formula (1), u represents a context
Figure BDA0001987379750000107
Any word in it, is taken up or taken off>
Figure BDA0001987379750000108
Indicating that the current word is known to be pick>
Figure BDA0001987379750000109
Then, the conditional probability of the word u is predicted, and the following are:
Figure BDA00019873797500001010
in the formula (2), H is the height of Huffman tree T, d h The h-th leaf node in the path from the root node to the leaf node corresponding to the word u corresponds to the h-th digit in the Huffman code; theta h Leaves corresponding to the root node to the word uThe parameters corresponding to the h leaf node in the path between the nodes;
calculating the ith word using equation (3)
Figure BDA00019873797500001011
Is greater than or equal to>
Figure BDA00019873797500001012
At the parameter theta h By the action of (3) a h-th digit d h Has a probability of->
Figure BDA00019873797500001013
Figure BDA00019873797500001014
In formula (3), σ (·) represents an excitation function;
step 1.6: calculating probability using equation (4)
Figure BDA00019873797500001015
Is greater than or equal to the maximum likelihood function->
Figure BDA00019873797500001016
Figure BDA00019873797500001017
/>
Step 1.7: for maximum likelihood function
Figure BDA00019873797500001018
Parameter θ of h And word vector pick>
Figure BDA00019873797500001019
Respectively obtaining corresponding gradients for updating the parameter theta h And word vector pick>
Figure BDA00019873797500001020
To obtain an updated parameter theta' j And word vector pick>
Figure BDA00019873797500001021
Further obtaining an updated word vector set Vec 'as a word2Vec model and assigning the updated word vector set Vec' to the original word vector set Vec;
step 1.8: and assigning M +1 to M, and returning to the step 1.5 until M is greater than M, so as to obtain a final word vector Vec.
In this embodiment, a word2vec model containing word vectors as shown in table 1 is trained from an external corpus. Where the feature dimension =50, the vocab size =171678.
TABLE 1word2vec word vectors
Figure BDA0001987379750000111
Step 2: constructing a dynamic linear LR classifier Model and a global class label set L according to FIG. 3:
step 2.1: defining a time t, and initializing t =0;
step 2.2: acquiring a data block at time t in any data stream
Figure BDA0001987379750000112
N =1,2.. N, where N is a data block D at time t t The total number of texts in (a), based on the number of text in (b), and/or based on the number of text in (b)>
Figure BDA0001987379750000113
For a data block D at time t t The nth text in>
Figure BDA0001987379750000114
For a data block D at time t t Class label of the nth text in order of L t Representing a block of data D at time t t And a class label set of (c) and having: />
Figure BDA0001987379750000115
Figure BDA0001987379750000116
For time t class label set L t J =1,2.. J, which is a data block D at time t t The total number of different types of labels in the label list;
step 2.3 class Label set L t Put into the global class label set L and then traverse the nth text
Figure BDA0001987379750000117
Obtaining a word vector of each word, then calculating an average value, and taking the average value as the nth text->
Figure BDA0001987379750000121
Is greater than or equal to>
Figure BDA0001987379750000122
Thereby obtaining a data block D at the time t t The text vectors of all texts in the text table are replaced, and the data block D at the time t is replaced t The replaced data block at the t moment is obtained>
Figure BDA0001987379750000123
Step 2.4: class label set L according to time t t The replaced data block D at the time t t Polymerizing to obtain a data block at the t time after polymerization
Figure BDA0001987379750000124
Wherein,
Figure BDA0001987379750000125
for aggregated data block D at time t t In all class tags is->
Figure BDA0001987379750000126
A set of text vectors of;
step 2.5: creating and randomly initializing the jth parameter vector at time t
Figure BDA0001987379750000127
And &>
Figure BDA0001987379750000128
Thus obtaining the set of parameter vectors at time t>
Figure BDA0001987379750000129
And &>
Figure BDA00019873797500001222
Step 2.6: initializing j =1;
step 2.7: aggregating the t time data block D t Wherein all classes are labeled as
Figure BDA00019873797500001210
Set of text vectors
Figure BDA00019873797500001211
Is assigned a value of 1 and is taken as a positive training sample, the remaining class labels are not->
Figure BDA00019873797500001220
Assigning a class label corresponding to the text vector set to 0, and using the class label as a negative training sample;
constructing training text vector set by positive training samples and negative training samples
Figure BDA00019873797500001212
Wherein->
Figure BDA00019873797500001221
Step 2.8: when t >0, perform step 2.10, otherwise perform step 2.9;
step 2.9: random initialization class labels as
Figure BDA00019873797500001213
Is taken into consideration>
Figure BDA00019873797500001214
Step 2.10: trainting training text vector set Train t And the jth parameter vector at the t moment is calculated according to the equations (5) and (6)
Figure BDA00019873797500001215
And &>
Figure BDA00019873797500001216
Updated to the jth parameter vector at time t +1>
Figure BDA00019873797500001223
And &>
Figure BDA00019873797500001218
Figure BDA00019873797500001219
Figure BDA0001987379750000131
In the formulae (5) and (6),
Figure BDA0001987379750000132
set ≥ for text vector>
Figure BDA0001987379750000133
Nth text vector at time t
Figure BDA0001987379750000134
Is optimized for the iteration parameter factor of->
Figure BDA0001987379750000135
Set ≥ for text vector>
Figure BDA0001987379750000136
The nth text vector at time instant t->
Figure BDA0001987379750000137
Is based on the loss function>
Figure BDA0001987379750000138
Weight the characteristic->
Figure BDA0001987379750000139
And has:
Figure BDA00019873797500001310
Figure BDA00019873797500001311
/>
Figure BDA00019873797500001312
in the formula (8), the reaction mixture is,
Figure BDA00019873797500001313
for the text vector at time t->
Figure BDA00019873797500001314
Is predicted probability of, and
Figure BDA00019873797500001315
step 2.11: j +1 is assigned to J, the step 2.8 is returned until J is larger than J, and therefore a parameter vector set NP at the moment of t +1 is obtained t+1 And ZP t+1
Step 2.12: traversing t +1 moment parameter vector set NP t+1 And ZP t+1 And initializing j =1;
step 2.13: calculating the jth characteristic weight at the t +1 moment by using the formula (10)
Figure BDA00019873797500001316
Figure BDA00019873797500001317
In formula (10), λ 1 、λ 2 Alpha and beta are both hyper-parameters, zgn (·) is an upward rounding function;
step 2.14: using the jth feature weight at time t +1
Figure BDA00019873797500001321
Construction of a Linear LR classifier->
Figure BDA00019873797500001318
And the jth classifier at the moment t +1 is->
Figure BDA00019873797500001319
Dynamic linear LR classifier Model added at time t +1 t+1 Performing the following steps;
step 2.15: j +1 is assigned to J, and the step 2.13 is returned until J is larger than J, so that the dynamic linear LR classifier Model at the moment of t +1 is obtained t+1 Is composed of
Figure BDA00019873797500001320
And step 3: incrementally updating the global label set L, modifying and adding the classifier Model at the t +1 moment t+1
Step 3.1: initializing t =1;
step 3.2: according to the processing method from the step 2.2 to the step 2.3, the replaced data block D at the time t is obtained t
Step 3.3: traverse the data block D at time t t The nth text
Figure BDA00019873797500001414
Is greater than or equal to>
Figure BDA0001987379750000141
And obtains the text using equation (11)Vector->
Figure BDA0001987379750000142
Is selected based on the jth class label->
Figure BDA00019873797500001415
Has a probability of->
Figure BDA0001987379750000143
Resulting in a text vector ≥>
Figure BDA0001987379750000144
Is based on the set of class label probabilities +>
Figure BDA0001987379750000145
Figure BDA0001987379750000146
Step 3.4: class label probability aggregation
Figure BDA0001987379750000147
Using the class label corresponding to the medium maximum probability as a text vector
Figure BDA0001987379750000148
Is marked as->
Figure BDA0001987379750000149
Thereby obtaining a data set D at the time t t The prediction class label set ≥ of all text vectors in>
Figure BDA00019873797500001410
Step 3.5: the classifier Model at time t is calculated by equation (12) t For the data block D at the current time t t Accuracy of prediction of (Accuracy) t
Figure BDA00019873797500001411
Step 3.6: according to the processing method in the step 2.4, the aggregated data block D at the time t is obtained t
Step 3.7: defining the total iteration number as Inter if Accuracy t Less than a threshold λ, inter = num will be initialized; num is the set drift weight; otherwise, initializing Inter =1;
step 3.8: defining the current iteration number as an inter, and initializing the inter =1;
step 3.9: obtaining a parameter vector NP of the first inter iteration at the time of t +1 according to the processing method from the step 2.6 to the step 2.11 t+1,inter And ZP t+1,inter
Step 3.10: NP is t+1,inter And ZP t+1,inter Assign values to NPs, respectively t And ZP t
Step 3.11: assigning Inter +1 to the Inter, and returning to the step 3.9 to execute until the Inter is greater than the Iter, thereby obtaining the jth parameter vector of the Inter iteration at the moment of t +1
Figure BDA00019873797500001412
And &>
Figure BDA00019873797500001413
And assigning to a parameter vector set NP at the moment t +1 t+1 And ZP t+1
Step 3.12: traversing the t +1 moment parameter vector set NP of the set t+1 And ZP t+1 And initializing j =1;
step 3.13: calculating the jth characteristic weight at the t +1 moment by using the formula (10)
Figure BDA0001987379750000151
Step 3.14: if the jth classifier at the moment of t +1
Figure BDA0001987379750000152
Will then >>
Figure BDA0001987379750000153
Is assigned to
Figure BDA0001987379750000154
Thereby updating the classifier Model at time t +1 t+1
If the jth classifier at the moment of t +1
Figure BDA0001987379750000155
Then the jth feature weight at time t +1 is used
Figure BDA0001987379750000156
Construction of a Linear LR classifier->
Figure BDA0001987379750000157
Will then->
Figure BDA0001987379750000158
Classifier Model added at time t +1 t+1 Performing the following steps;
step 3.15: j +1 is assigned to J, and the step 3.13 is returned to execute until J is larger than J; thereby obtaining the dynamic linear LR classifier Model at the time of t +1 t+1 Is composed of
Figure BDA0001987379750000159
Step 3.16: assigning T +1 to T, and returning to the step 3.2 until T is greater than T, wherein T is the ending time of the data stream;
and 4, step 4: predicting test data flow, acquiring a new word sample set S which does not appear in a word set Vocab, and constructing a word vector model C e And a word vector expansion set Vec-expanded;
step 4.1: defining a word vector expansion set Vec-expanded and initializing to an empty set, wherein t =0 is initialized;
step 4.2: obtaining test data block at time t in test data stream
Figure BDA00019873797500001510
Wherein->
Figure BDA00019873797500001511
Is a test data block D 'at the moment of t' t The nth test text and->
Figure BDA00019873797500001512
Is->
Figure BDA00019873797500001513
The ith word in the nth test text at the time t;
defining a set S of new words text at time t t And initialized to be an empty set, and then test data block D 'at t moment is traversed' t If the nth test text at the time t is the test text of all the test texts
Figure BDA00019873797500001514
All the words in (b) belong to the set of words Vocab, the nth test text is not asserted>
Figure BDA00019873797500001515
Adding a new word text set S t In, otherwise, the nth test text is ≥ h>
Figure BDA00019873797500001516
Adding a new word text set S t (ii) a Thereby obtaining an updated new word text set S 'at the time t' t And a corresponding new word set Vocab' t ;/>
Step 4.3: if new word set text set S t If the total number of the test texts is greater than the threshold value, the updated new word text set S 'is processed by the processing method in the step 1' t Building a word2vec model and obtaining a new word vector set
Figure BDA00019873797500001517
Wherein Vec * (w′ y,t ) Representing a new word set Vocab 'at time t' t The y-th word w' y,t The word vector of (2); y =1,2.. Y, Y is a set of new words Vocab' t The total number of Chinese words; whether or notThen, step 5 is executed;
step 4.4: go through
Figure BDA00019873797500001518
If Vec * (w′ y,t ) E, the Vec is reserved * (w′ y,t ) Assign to Vec * (w′ y,t+1 ) Otherwise Vec will be * (w′ y,t ) Adding the word vector expansion set Vec-expanded;
and 5: test data block D 'at moment t is obtained' t Class label strength of (2):
step 5.1: obtaining a test data block D 'at the time t by utilizing the processing methods from the step 3.2 to the step 3.4' t Class label probability set of all test text vectors in
Figure BDA0001987379750000161
And a set of predicted class labels
Figure BDA0001987379750000162
And are combined into
Figure BDA0001987379750000163
Wherein->
Figure BDA0001987379750000164
Test data block D 'representing time t' t The nth test text->
Figure BDA0001987379750000165
The predicted class label is->
Figure BDA0001987379750000166
Has a probability of->
Figure BDA0001987379750000167
Step 5.2: acquiring class label probability set Pro' t Fractional set of all class labels in
Figure BDA0001987379750000168
Wherein->
Figure BDA0001987379750000169
Is a class label probability set Pro' t A medium prediction class label is->
Figure BDA00019873797500001610
Test text number of (2) divided by class tag probability set Pro' t The ratio of the total test text number in the test text;
acquiring class label probability set Pro' t Maximum probability set max _ Pro of all predicted class tags in t ={max_pro 1 ,max_pro 2 ...max_pro j ...max_pro J Where max _ pro j Is a class label probability set Pro' t The middle class label is
Figure BDA00019873797500001611
The corresponding maximum probability;
step 5.3: will present t And max _ Pro t Is taken as test data block D' t Class label strength topic of t
Step 5.4: assigning T +1 to T, and returning to the step 4.2 until T is greater than T ', wherein T' is the end time of the test data stream; thereby obtaining the label-like strength set { topic) of the test data stream 1 ,topic 2 ...topic t ,...topic T′ }。
Example (b): according to the classification method of short text data stream, as shown in fig. 1, the method specifically comprises the following steps:
(1) Obtaining data block D at time t of data stream t Wherein the data are mainly shown in table 2;
table 2 short text data stream 15 sample data
Figure BDA00019873797500001612
/>
Figure BDA0001987379750000171
(2) For the first training data block D 1 Initializing L = { }, model = { };
(3) Obtaining D 1 Class label L of 1 = Health, sci _ tech, entertainmen, and will L 1 Adding a global variable L = { Health, sci _ tech, entretainmen };
(4) Each text is converted to a text vector based on the Vec word vector set (see Table 1), e.g.
Figure BDA0001987379750000172
Figure BDA0001987379750000173
I.e. using word vector models to map D 1 Conversion to table 3;
TABLE 3 vectorized data blocks
Figure BDA0001987379750000174
/>
Figure BDA0001987379750000181
(5) Will D 1 Grouped according to class labels, i.e.
Figure BDA0001987379750000182
(6) Go through
Figure BDA0001987379750000183
Initializing NP, ZP and the corresponding feature weight->
Figure BDA0001987379750000184
I.e. is>
Figure BDA0001987379750000185
(7) Traversal set NP 0 ZP 0 . For NP 1 0 ZP 1 0 Random initialization
Figure BDA0001987379750000186
Based on a positive training set as->
Figure BDA0001987379750000187
The negative training set is->
Figure BDA0001987379750000188
An FTRL update is performed. New NP after traversing 1 0 ZP 1 0 Updating to NP and ZP. Repeatedly performs>
Figure BDA0001987379750000189
Figure BDA00019873797500001810
Updating NP and ZP;
(8) Traversing NP, ZP in (5), and formula (10) wherein λ 1 =0.01β=0.01,λ 1 =0.01λ 2 =0.01 calculates weight and constructs a linear classifier model using weight, thereby obtaining a model 1 ,model 2 ,model 3 Adding a global classifier Model;
(9) For other data blocks D t E.g. D 2 Vectorizing D according to the processing method in the step (4) 2 . Obtaining D 2 Class label L of 2 = Us, sport, world, health, business }. Since L = { Health, sci _ tech, entretainmen }, where Us, sport, world, business }, in which
Figure BDA00019873797500001813
L, namely adding Us, sport, world and Business into L, namely L = { Health, sci _ tech, entertainen, us, sport, world, business }.
(10) Traverse D t Such as D 2 For each one
Figure BDA00019873797500001811
Using Model = { M 1 ,M 2 ,M 3 The probability, i.e. the result, of calculating each class label is shown in table 4:
table 4 data block prediction results
Figure BDA00019873797500001812
/>
Figure BDA0001987379750000191
Obtaining D 2 The maximum possible class labels Lab = { entertaine, sci _ tech, entertaine, health, entertaine } and D in each text 2 The set of mid class tag probabilities pro = {0.5,0.6,0.4,0.8,0.6}, and the calculated accuracy =20%.
(11) Will D 2 Polymerization according to L to give
Figure BDA0001987379750000192
(12) Defining an initial total iteration number Iter, since accuracy =20% <50%, i.e. initialization Iter =3;
(13) Acquiring NP and ZP sets of the previous data block, and updating NP and ZP parameter sets after iterating and executing Iter times in the step (7);
(14) Calculating the characteristic weight of the classifier corresponding to each class label by using the parameter sets NP and ZP, namely
Figure BDA0001987379750000193
(15) In the step (8), only classifiers corresponding to labels Health, sci _ tech, entretainmen are in the Model classifier group. Adding an LR classifier constructed by the characteristic weight corresponding to Us, sport, world and Business into a Model, and updating the classifier weight corresponding to Health;
(16) Assigning t +1 to t, and returning to the step (9) until the data flow is finished;
(17) A new word text set S and a word vector expansion library Vec-expanded are defined and initialized to an empty set.
(18) For a data block D at the time t of the test data stream t E.g. D n First traverse D n If, if
Figure BDA0001987379750000194
E.g. is->
Figure BDA0001987379750000195
Will->
Figure BDA0001987379750000196
In S, i.e. <' >>
Figure BDA0001987379750000197
If S is greater than>0, namely, using the processing method of step 1, the word2Vec model of S is obtained and put into the Vec-expanded expansion library. If the Vec-expand already has the current word vector, the Vec-expand is updated. Currently, vec-expand = { }, i.e., put the word vector of S into Vec-expand, i.e., vec-expand = { Vec (mariah), vec (carey), vec (join), vec (simon), vec (cowells), vec (factor) };
(19) Vectorizing D by using the processing method of step (4) n In the treatment method using the step (10), D is obtained n The predicted class label set Lab and the class label probability set Pro, where Lab = { bussiness. Us, bussiness, entertainement }, pro = { 0.9.7.0.9.0.5.7 };
(20) For the predicted Pro, obtaining Pro t Wherein each class label is in D t The present of (a) and the corresponding maximum probability set mx _ pro. For the above embodiment example D n Namely present = { Business:0.4 us; max _ pro = { Business:0.9 us;
(21) And assigning t +1 to t, and returning to the step (18) until the test data flow is ended.

Claims (1)

1. A fast incremental classification method of short text data streams based on a word2vec model is characterized by comprising the following steps:
step 1: constructing a word2vec model:
step 1.1: obtaining an external corpus of text from a knowledge base, denoted C 1 ={d 1 ,d 2 ...d m ...d M M =1,2.. M, where M represents the text external corpus C 1 Total number of texts of d m Represents the m-th text and has
Figure FDA0001987379740000011
I denotes the mth text d m The number of the Chinese words is equal to the number of the Chinese words,
Figure FDA0001987379740000012
representing said mth text d m The ith word in, and
Figure FDA0001987379740000013
vocab represents an external corpus of text C 1 Word set of all different words in (1), and Vocab = { w = { (w) 1 ,w 2 ....w z ....w Z Z, Z representing the total number of words in the set of words Vocab, w z Represents the z-th word in the word set Vocab and leads the z-th word w z The word vector of (c) is denoted as Vec (w) z ) Thus, for the set of word vectors Vec = { Vec (w) 1 ),Vec(w 2 )…Vec(w z )…Vec(w Z ) Fifthly, randomly initializing the word vector set Vec;
step 1.2: counting all words in the word set Vocab in the text external corpus C 1 The number of occurrences in (1), i.e. word frequency, is less than or equal to the lower frequency limit F 1 Or greater than or equal to the upper frequency limit F 2 Deleting the word frequency to obtain a filtered word frequency set; sorting the filtered word frequency sets in a descending order, and constructing a Huffman tree T for the sorted word frequency sets;
step 1.3: setting the window size to window, and traversing the text external corpus C 1 M-th text d in m And obtains each word of the current wordThe ith word
Figure FDA0001987379740000014
In the context of
Figure FDA0001987379740000015
The known current word is calculated by equation (1) as
Figure FDA0001987379740000016
Predict its context
Figure FDA0001987379740000017
Conditional probability of (2)
Figure FDA0001987379740000018
Figure FDA0001987379740000019
In the formula (1), u represents a context
Figure FDA00019873797400000110
Any one of the words in any one of the above,
Figure FDA00019873797400000111
indicating that the current word is known as
Figure FDA00019873797400000112
Then, the conditional probability of the word u is predicted, and the following are provided:
Figure FDA00019873797400000113
in the formula (2), H is the height of Huffman tree T, d h Is that the h-th leaf node in the path from the root node to the leaf node corresponding to the word u corresponds to the Huffman codeThe h-th digit of (1); theta h A parameter corresponding to the h-th leaf node in a path from the root node to the leaf node corresponding to the word u;
calculating the ith word using equation (3)
Figure FDA0001987379740000021
Word vector of
Figure FDA0001987379740000022
At the parameter theta h By the presence of the h-th digit d h Probability of (2)
Figure FDA0001987379740000023
Figure FDA0001987379740000024
In formula (3), σ (·) represents an excitation function;
step 1.4: calculating probability using equation (4)
Figure FDA0001987379740000025
Maximum likelihood function of
Figure FDA0001987379740000026
Figure FDA0001987379740000027
Step 1.5: for maximum likelihood function
Figure FDA0001987379740000028
Parameter θ of h Sum word vector
Figure FDA0001987379740000029
Respectively calculating partial derivatives to obtain corresponding gradients for updating the parameter theta h Sum word vector
Figure FDA00019873797400000210
To obtain an updated parameter theta' j Sum word vector
Figure FDA00019873797400000211
Further obtaining an updated word vector set Vec 'as a word2Vec model and assigning the updated word vector set Vec' to the original word vector set Vec;
step 2: constructing a dynamic linear LR classifier Model and a global class label set L:
step 2.1: defining a time t, and initializing t =0;
step 2.2: acquiring a data block at time t in any data stream
Figure FDA00019873797400000212
Wherein N is a data block D at time t t The total number of texts in (a),
Figure FDA00019873797400000213
for a data block D at time t t The (n) th text of the (n),
Figure FDA00019873797400000214
for a data block D at time t t Class label of the nth text in order of L t Representing a block of data D at time t t And a class label set of (c) and having:
Figure FDA00019873797400000215
Figure FDA00019873797400000216
for time t class label set L t J =1,2.. J, which is a data block D at time t t The total number of different types of labels in the label list;
step 2.3 class Label set L t Put into the global class label set L and then traverse the nth text
Figure FDA00019873797400000217
Obtaining a word vector of each word, then obtaining an average value, and using the average value as the nth text at the time t
Figure FDA00019873797400000218
Text vector of
Figure FDA00019873797400000219
Thereby obtaining a data block D at the time t t The text vectors of all texts in the text table are replaced, and the data block D at the time t is replaced t Obtaining the replaced t-time data block from each text
Figure FDA00019873797400000220
Step 2.4: class label set L according to time t t The replaced data block D at the time t t Polymerizing to obtain a data block at the t time after polymerization
Figure FDA00019873797400000221
Wherein,
Figure FDA00019873797400000222
for aggregated data block D at time t t Wherein all classes are labeled as
Figure FDA00019873797400000223
A set of text vectors of;
step 2.5: creating and randomly initializing the jth parameter vector at time t
Figure FDA0001987379740000031
And
Figure FDA0001987379740000032
thereby obtaining a parameter vector set at the time t
Figure FDA0001987379740000033
And
Figure FDA0001987379740000034
step 2.6: initializing j =1;
step 2.7: the aggregated data block D at the time t t Wherein all classes are labeled as
Figure FDA0001987379740000035
Set of text vectors
Figure FDA0001987379740000036
The value of class label of (1) is assigned as a positive training sample, and the rest class labels are not
Figure FDA0001987379740000037
Assigning a class label corresponding to the text vector set to 0, and using the class label as a negative training sample;
constructing a training text vector set by the positive training sample and the negative training sample
Figure FDA0001987379740000038
Wherein
Figure FDA0001987379740000039
Step 2.8: when t >0, perform step 2.10, otherwise perform step 2.9;
step 2.9: random initialization class labels as
Figure FDA00019873797400000310
Characteristic weight of
Figure FDA00019873797400000311
Step 2.10: trainting training text vector set Train t And the jth parameter vector at the t moment is calculated according to the equations (5) and (6)
Figure FDA00019873797400000312
And
Figure FDA00019873797400000313
updating to the jth parameter vector at the moment of t +1
Figure FDA00019873797400000314
And
Figure FDA00019873797400000315
Figure FDA00019873797400000316
Figure FDA00019873797400000317
in the formulae (5) and (6),
Figure FDA00019873797400000318
for a set of text vectors
Figure FDA00019873797400000319
Nth text vector at middle t moment
Figure FDA00019873797400000320
The optimization of the iteration parameter factor of (2),
Figure FDA00019873797400000321
for a set of text vectors
Figure FDA00019873797400000322
Nth text vector at middle t moment
Figure FDA00019873797400000323
Loss function of
Figure FDA00019873797400000324
For feature weight
Figure FDA00019873797400000325
And has:
Figure FDA00019873797400000326
Figure FDA00019873797400000327
Figure FDA00019873797400000328
in the formula (8), the reaction mixture is,
Figure FDA0001987379740000041
for the text vector at the t moment
Figure FDA0001987379740000042
Is predicted probability of, and
Figure FDA0001987379740000043
step 2.11: j +1 is assigned to J, the step 2.8 is returned until J is larger than J, and therefore a parameter vector set NP at the moment of t +1 is obtained t+1 And ZP t+1
Step 2.12: traversing t +1 moment parameter vector set NP t+1 And ZP t+1 And initializing j =1;
step 2.13: calculating the jth characteristic weight at the t +1 moment by using the formula (10)
Figure FDA0001987379740000044
Figure FDA0001987379740000045
In formula (10), λ 1 、λ 2 Alpha and beta are both hyper-parameters, zgn (·) is an upward rounding function;
step 2.14: using the jth feature weight at time t +1
Figure FDA0001987379740000046
Construction of Linear LR classifier
Figure FDA0001987379740000047
And the jth classifier at the t +1 moment
Figure FDA0001987379740000048
Dynamic linear LR classifier Model added at time t +1 t+1 Performing the following steps;
step 2.15: j +1 is assigned to J, and the step 2.13 is returned until J is larger than J, so that the dynamic linear LR classifier Model at the moment of t +1 is obtained t+1 Is composed of
Figure FDA0001987379740000049
And step 3: incrementally updating the global label set L, modifying and adding the classifier Model at the t +1 moment t+1
Step 3.1: initializing t =1;
step 3.2: according to the processing method from the step 2.2 to the step 2.3, the replaced data block D at the time t is obtained t
Step 3.3: traverse the data block D at time t t The nth text
Figure FDA00019873797400000410
Text vector of
Figure FDA00019873797400000411
And obtains a text vector using equation (11)
Figure FDA00019873797400000412
Class jth tag of
Figure FDA00019873797400000413
Probability of (2)
Figure FDA00019873797400000414
Thereby obtaining a text vector
Figure FDA00019873797400000415
Class label probability set of
Figure FDA00019873797400000416
Figure FDA00019873797400000417
Step 3.4: class label probability aggregation
Figure FDA00019873797400000418
Using the class label corresponding to the medium maximum probability as a text vector
Figure FDA00019873797400000419
Is marked as
Figure FDA00019873797400000420
Thereby obtaining a data set D at the time t t Prediction class label set of all text vectors in
Figure FDA00019873797400000421
Step 3.5: the classifier Model at time t is calculated by equation (12) t In pair whenData block D at time t t Accuracy of prediction of (Accuracy) t
Figure FDA0001987379740000051
Step 3.6: according to the processing method in the step 2.4, the aggregated data block D at the time t is obtained t
Step 3.7: defining the total iteration number as Inter if Accuracy t Less than a threshold λ, inter = num will be initialized; num is the set drift weight; otherwise, initializing Inter =1;
step 3.8: defining the current iteration number as an inter, and initializing the inter =1;
step 3.9: obtaining a parameter vector NP of the first inter iteration at the time of t +1 according to the processing method from the step 2.6 to the step 2.11 t+1,inter And ZP t+1,inter
Step 3.10: the NP is t+1,inter And ZP t+1,inter Assign values to NPs, respectively t And ZP t
Step 3.11: assigning Inter +1 to the Inter, and returning to the step 3.9 to execute until the Inter is greater than the Iter, thereby obtaining the jth parameter vector of the Inter iteration at the moment of t +1
Figure FDA0001987379740000052
And
Figure FDA0001987379740000053
and assigning to a parameter vector set NP at the moment t +1 t+1 And ZP t+1
Step 3.12: traversing the t +1 moment parameter vector set NP of the set t+1 And ZP t+1 And initializing j =1;
step 3.13: calculating the jth characteristic weight at the t +1 moment by using the formula (10)
Figure FDA0001987379740000054
Step 3.14: if the jth classifier at the moment of t +1
Figure FDA0001987379740000055
Then will be
Figure FDA0001987379740000056
Is assigned to
Figure FDA0001987379740000057
Thereby updating the classifier Model at time t +1 t+1
If the jth classifier at the moment of t +1
Figure FDA0001987379740000058
Then the jth feature weight at time t +1 is used
Figure FDA0001987379740000059
Construction of Linear LR classifier
Figure FDA00019873797400000510
Then will be
Figure FDA00019873797400000511
Classifier Model added at time t +1 t+1 Performing the following steps;
step 3.15: j +1 is assigned to J, and the step 3.13 is returned to execute until J is larger than J; thereby obtaining the dynamic linear LR classifier Model at the time of t +1 t+1 Is composed of
Figure FDA00019873797400000512
Step 3.16: assigning T +1 to T, and returning to the step 3.2 until T is greater than T, wherein T is the ending moment of the data stream;
and 4, step 4: predicting test data flow, acquiring a new word sample set S which does not appear in a word set Vocab, and constructing a word vector model C e And a word vector expansion set Vec-expand;
step 4.1: defining a word vector expansion set Vec-expanded and initializing to an empty set, wherein t =0 is initialized;
step 4.2: obtaining test data block at time t in test data stream
Figure FDA0001987379740000061
Wherein
Figure FDA0001987379740000062
For test data block D at time t t ' the n-th test text in, and
Figure FDA0001987379740000063
is composed of
Figure FDA0001987379740000064
The ith word in the nth test text at the time t;
defining a set S of new words text at time t t And initialized to an empty set, and then test data block D at t moment is traversed t All test texts in' if the nth test text at time t
Figure FDA0001987379740000065
If all the words in the test text belong to the word set Vocab, the nth test text is not used
Figure FDA0001987379740000066
Adding new word text set S t Otherwise, testing the nth test text
Figure FDA0001987379740000067
Adding a new word text set S t (ii) a Thereby obtaining an updated new word text set S 'at the time t' t And a corresponding new word set Vocab' t
Step 4.3: if the new word set at the time t is the text set S t If the total number of the test texts is greater than the threshold value, the updated new word text set S 'is processed by the processing method in the step 1' t Building a word2vec model and obtainingVector set to new words
Figure FDA0001987379740000068
Wherein Vec * (w′ y,t ) Representing a new word set Vocab 'at time t' t The y-th word w' y,t The word vector of (2); y =1,2.. Y, Y is a set of new words Vocab' t The total number of Chinese words; otherwise, executing step 5;
step 4.4: go through
Figure FDA0001987379740000069
If Vec * (w′ y,t ) E is the Vec-expand, then the Vec is added * (w′ y,t ) Assign to Vec * (w′ y,t+1 ) Otherwise Vec will be * (w′ y,t ) Adding the word vector expansion set Vec-expanded;
and 5: test data block D 'at moment t is obtained' t Class label strength of (2):
step 5.1: obtaining a test data block D 'at the time t by utilizing the processing methods from the step 3.2 to the step 3.4' t Class label probability set of all test text vectors in
Figure FDA00019873797400000610
And predictive class labelsets
Figure FDA00019873797400000611
And are combined into
Figure FDA00019873797400000612
Wherein
Figure FDA00019873797400000613
Test data block D 'representing time t' t Test text of the nth
Figure FDA00019873797400000614
The predicted class label is
Figure FDA00019873797400000615
Has a probability of
Figure FDA00019873797400000616
Step 5.2: acquiring class label probability set Pro' t Fractional set of all class labels in
Figure FDA00019873797400000617
Wherein
Figure FDA00019873797400000618
Is a class label probability set Pro' t The medium prediction class label is
Figure FDA0001987379740000071
Test text number of (2) divided by class tag probability set Pro' t The ratio of the total test text number in the test text;
acquiring class label probability set Pro' t Maximum probability set max _ Pro of all predicted class tags in t ={max_pro 1 ,max_pro 2 ...max_pro j ...max_pro J Where max _ pro j Is a class label probability set Pro' t The middle class label is
Figure FDA0001987379740000072
The corresponding maximum probability;
step 5.3: will present t And max _ Pro t Is taken as test data block D' t Class label strength topic of t
Step 5.4: assigning T +1 to T, and returning to the step 4.2 until T is greater than T ', wherein T' is the end time of the test data stream; thereby obtaining the label-like strength set { topic) of the test data stream 1 ,topic 2 ...topic t ,...topic T′ }。
CN201910169255.3A 2019-03-06 2019-03-06 Quick incremental classification method for short text data stream based on word2vec model Active CN109918667B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910169255.3A CN109918667B (en) 2019-03-06 2019-03-06 Quick incremental classification method for short text data stream based on word2vec model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910169255.3A CN109918667B (en) 2019-03-06 2019-03-06 Quick incremental classification method for short text data stream based on word2vec model

Publications (2)

Publication Number Publication Date
CN109918667A CN109918667A (en) 2019-06-21
CN109918667B true CN109918667B (en) 2023-03-24

Family

ID=66963460

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910169255.3A Active CN109918667B (en) 2019-03-06 2019-03-06 Quick incremental classification method for short text data stream based on word2vec model

Country Status (1)

Country Link
CN (1) CN109918667B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111026846B (en) * 2019-12-09 2021-08-17 合肥工业大学 Online short text data stream classification method based on feature extension

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105824922A (en) * 2016-03-16 2016-08-03 重庆邮电大学 Emotion classifying method fusing intrinsic feature and shallow feature
CN106776534A (en) * 2016-11-11 2017-05-31 北京工商大学 The incremental learning method of term vector model
WO2017090051A1 (en) * 2015-11-27 2017-06-01 Giridhari Devanathan A method for text classification and feature selection using class vectors and the system thereof
CN107679228A (en) * 2017-10-23 2018-02-09 合肥工业大学 A kind of short text data stream sorting technique based on short text extension and concept drift detection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017090051A1 (en) * 2015-11-27 2017-06-01 Giridhari Devanathan A method for text classification and feature selection using class vectors and the system thereof
CN105824922A (en) * 2016-03-16 2016-08-03 重庆邮电大学 Emotion classifying method fusing intrinsic feature and shallow feature
CN106776534A (en) * 2016-11-11 2017-05-31 北京工商大学 The incremental learning method of term vector model
CN107679228A (en) * 2017-10-23 2018-02-09 合肥工业大学 A kind of short text data stream sorting technique based on short text extension and concept drift detection

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于word2vec词模型的中文短文本分类方法;高明霞等;《山东大学学报(工学版)》;20181102(第02期);全文 *

Also Published As

Publication number Publication date
CN109918667A (en) 2019-06-21

Similar Documents

Publication Publication Date Title
CN109376242B (en) Text classification method based on cyclic neural network variant and convolutional neural network
CN107608956B (en) Reader emotion distribution prediction algorithm based on CNN-GRNN
CN111563164B (en) Specific target emotion classification method based on graph neural network
CN109933670B (en) Text classification method for calculating semantic distance based on combined matrix
CN103955489B (en) Based on the Massive short documents of Information Entropy Features weight quantization this distributed KNN sorting algorithms and system
CN111125358B (en) Text classification method based on hypergraph
CN111985247B (en) Microblog user interest identification method and system based on multi-granularity text feature representation
CN108108351A (en) A kind of text sentiment classification method based on deep learning built-up pattern
CN107229610A (en) The analysis method and device of a kind of affection data
KR101561464B1 (en) Collected data sentiment analysis method and apparatus
CN110427616B (en) Text emotion analysis method based on deep learning
CN108985617B (en) Product production flow scheduling method and system based on intelligent manufacturing
CN106886580A (en) A kind of picture feeling polarities analysis method based on deep learning
CN110990718B (en) Social network model building module of company image lifting system
CN111078876A (en) Short text classification method and system based on multi-model integration
Rhodes Author attribution with cnns
CN110968692B (en) Text classification method and system
CN107679228A (en) A kind of short text data stream sorting technique based on short text extension and concept drift detection
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN109062958B (en) Primary school composition automatic classification method based on TextRank and convolutional neural network
CN113901211A (en) Multi-granularity Hanyue parallel sentence pair extraction method based on graph attention network
Hamim et al. Student profile modeling using boosting algorithms
CN110019796A (en) A kind of user version information analysis method and device
CN103268346A (en) Semi-supervised classification method and semi-supervised classification system
CN109918667B (en) Quick incremental classification method for short text data stream based on word2vec model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant