CN109933670A

CN109933670A - A kind of file classification method calculating semantic distance based on combinatorial matrix

Info

Publication number: CN109933670A
Application number: CN201910209354.XA
Authority: CN
Inventors: 裘嵘; 杨俊杰; 张祖平; 罗律
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2019-06-25
Anticipated expiration: 2039-03-19
Also published as: CN109933670B

Abstract

The invention discloses a kind of file classification methods that semantic distance is calculated based on combinatorial matrix, comprising steps of S1, handling Chinese text, generate the vector space model based on bag of words；S2, for full text set, use the bag of words text vector of generation as training corpus, using word2vec training term vector, and the term vector and text vector that combined training comes out, constitute a text matrix；S3, crossing operation is carried out to text matrix, obtains the semantic distance between text.The expression of text vector proposed by the present invention and the calculation method of semantic distance, had both overcome the defect of traditional bag of words, also improved the deficiency of TF-IDF algorithm, promoted the accuracy of text classification so as to train better disaggregated model.

Description

A kind of file classification method calculating semantic distance based on combinatorial matrix

Technical field

The invention belongs to relate to natural language processing field, semantic distance is calculated based on combinatorial matrix more particularly to a kind of File classification method.

Background technique

With the rapid development of internet gradually popularized with mechanics of communication, how to the network number increased with index rank Have become intractable and urgent research topic according to efficient organization and administration are carried out.And in these vast as the open sea documents, data It using the content of part is all greatly the process quilt classified automatically to these texts using text as the form of expression very much in data Referred to as text classification, text classification are a kind of technologies that pattern-recognition is combined with natural language processing, it is intended to according in text The attribute or feature of appearance, target text is associated in one or more classifications.Traditional Text Classification is with knowledge engineering Method based on, the expert of related fields artificially extracts the logic rule of some classification according to the classification experience of target text Then, in this, as the foundation of text classification.And in recent years, with machine learning and deep learning the relevant technologies and Computing The promotion of speed, the file classification method based on statistical machine learning start to be favored and in the accuracy rate of classification results and surely Qualitative upper acquirement significant advantage.

The technical process of text classification of the industry based on statistical machine learning at present, usually has several important steps: one, right Natural language text carries out digitization modeling, i.e., real text is expressed as the data mode that computer is capable of efficient process；Two, According to above-mentioned modeling method, all destination documents are all converted into specific data representation form；Three, the number of different document is provided According to the operation relation between expression；Four, the operation relation between the data representation form and different data of text document, design are utilized The machine learning model of text classification is simultaneously trained；Five, to given unknown category documents, it is converted into specific data table Up to form, and trained machine learning model is put into, obtains the class prediction result of the document.

In natural language processing, the digitization modeling of text is typically all to use vector space model (Vector SpaceModel, VSM), i.e., a text document is expressed as a vector in n-dimensional vector space, in vector it is each not With position represent a characteristic item, the numerical values recited of different location indicate the position in the weight of entire vector, both importance Size is made of the whole expression to text the summation of characteristic item.In Chinese, word is to express semanteme most in Chinese language Junior unit, therefore in Chinese natural language treatment process, it is usually all spy of the word as text vector chosen in text Item is levied, each word weight size of position in particular text vector indicates the importance of the word in a document.

The initial expression of bag of words (BagofWord, bow) as vector space model is generally used in practical application, Bag of words fix whole words in all texts with a kind of position, sequence random manner is put, because of each difference Word be all used as a unique characteristic item, so the word complete or collected works that put of these fixations constitute one for expressing The vector space of any text.Text specific for one, the weighted value of each characteristic item is the spy in its text vector The frequency number that the corresponding word of sign item occurs in the text, i.e., measure it at this with the number that word occurs in particular text Importance in text.In addition to text can be indicated with vector space model, word itself can also be in n-dimensional vector space One vector indicates, for a specific word in word complete or collected works, its term vector is indicated with one-hot coding (One-hot), The corresponding value in fixation position of the word where in vector space is 1, and indicates that remaining position of other words is 0.So Text specific for one is also considered as constituting the word of all words of the text by the text vector that bag of words generate Vector adds up.

In bag of words, due to the value only frequency by each position corresponding word in the text of text vector each position It is determined, therefore there is significant limitation.In addition to indicating text vector weight with word frequency, it is most commonly used to calculate on engineer application The method of text vector weight is word frequency inverse document frequency (TermFrequency-InverseDocumentFrequency, TF- IDF) algorithm, the algorithm idea are that the weight size of characteristic item is directly proportional to the frequency that this feature item occurs in a document, with The number of documents comprising this feature item is inversely proportional in entire corpus, and for document specific for one, some word is at this The number occurred in document is more, and importance is higher, if but the number that occurs in other documents of some word is more More illustrate that its generality represented, the weight of the word are also then lower.

For any two document Ds_xAnd D_yBetween similarity measurement, can be by two document vectors in n-dimensional space D_xAnd D_yCertain distance relation calculate, common method calculated using the inner product between vector, and the inner product of vector can be with For characterizing corner dimension or cosine similarity between two vectors.

However, in real life, Chinese everyday expressions have tens of thousands of, and word included in a document but generally only has Hundreds and thousands of, since the dimension of the vector space of generation is also tens of thousands of dimensions, the nonzero term in vector only has thousands of, obtains in this way Document vector sum term vector all by be higher-dimension sparse vector, the similitude that the inner product of two high-dimensional sparse vectors obtains As a result not only not accurate enough, and model with calculating process absolutely not considers characteristic item in text vector and characteristic item it Between, i.e., the semantic distance between word and word.

In recent years, the development of deep learning theory and the change of technology played deep effect to natural language processing, Wherein important is a kind of distributed expression for being known as word insertion (WordEmbedding) (DistributedRepresentation) technology, distribution are indicated theoretical based on distributional assumption, are obtained using co-occurrence matrix The semantic meaning representation of word, and word insertion then realizes word being mapped to a new space, and with the continuous real vector of multidimensional It indicates.In word embedding grammar, most notable is the word2vec model that Google proposes, is instructed by artificial neural network algorithm Practice language model, and obtains low-dimensional vector corresponding to word in the training process.It is this to indicate term vector using lower dimensional space Method, not only solve the problems, such as dimension disaster, also excavate the relating attribute between word, to improve term vector in language Accuracy in justice expression.

Summary of the invention

The method of semantic distance is different from above-mentioned traditional plan between the method proposed by the present invention for indicating text and calculating text Slightly.

In the expression modeling method of text, it is assumed that word complete or collected works' number for expressing all texts is n, then text The vector space dimension of language model is n dimension, then needs the term vector dimension of training for m dimension in regulation.Have for one The text document of body, first using TF-IDF algorithm calculate document in each word weighted value, obtain text n dimension weight to Amount, but and it is indirect express text using this weight vectors, but for by the value of characteristic item each in weight vectors, i.e. word The weight of language is realized multiplied by the term vector of the word of this feature item position and is embedded into the m dimension term vector of each word respectively In the weight of word, the new m dimension weighting term vector of several multiplied one arrived is replaced into the corresponding specific location of original each word Weighted value finally obtains the expression matrix of n × m size about text, and every a line of text matrix is all a row vector, row The value of vector is equal to term vector and the number of the weight of the word multiplies.It is big that the row vector not only carries the weight of some word in the text It is small, the syntactical and semantical feature of this word is also carried, is just indicated the vector space of text from traditional by this means Vector is extended for matrix, keeps its semantic meaning representation ability and the information content of carrying more abundant.

It is different from the mode that traditional text vector measures text semantic distance by way of calculating cosine similarity, this Invention proposes modeling method of the text semantic distance calculating method based on above-mentioned text representation.For two different texts, number According to changing the matrix that expression-form is n × m, each row vector of document matrix 1 is not only row with 2 corresponding position of document matrix Vector calculates similarity, but calculates similarity with each of document matrix 2 row vector, and calculated result is added up conduct The Semantic Similarity measurement results of two documents, this method thought have semantic distance and grammatical relation between word by term vector Characteristic can more accurately calculate semantic distance and co-occurrence probabilities between the different characteristic items of different texts.

It is chosen in the method for machine learning classification model, the field as designed by the present invention is in natural language processing Text classification, and text according to the different criteria for classifying can there are many classification, to this binary classifier using limited, Therefore frequently with multi-class classifier in task.In multi-class classifier based on distance calculate sorting algorithm than support to The algorithms most in use such as amount machine, multivariate logistic regression are more applicable with current application scene, because the most important foundation of classification task is to be based on Semantic distance between text, and the former has lower algorithm complexity and operand, can speed up the training and prediction of model.

The present invention uses KNN sorting algorithm, and combines the thought of iteration class centroid in K-Means clustering algorithm.KNN is calculated The thinking of method is to calculate a unknown sample (language distance is most i.e. in language model closest to K with it in feature space Known sample closely) predicts that the unknown sample equally belongs to if most of in this K sample belong to some particular category In this particular category.KNN algorithm needs to calculate unknown sample at a distance from samples all in feature space, if feature samples For the complete or collected works of training set, then the calculation amount of a classification task will be very big, therefore the present invention is by the way of " choosing represents " Feature space, the i.e. scale according to the amount of text in different text categories in full text are set, from each class In not according to weighting quantity choose some samples with " representativeness " as the category part sample in feature space.And it is right In the selection of " representativeness " feature, then the method that mass center (cluster centre) is iterated to calculate in K-Means clustering method is drawn, for A classification in training sample, randomly selects several samples, and calculate the mass center of these samples from classification every time, repeats Repeatedly, the quantity of sample in the category is depended on from the number for choosing the mass center that random sample calculates in each classification, finally will Sample space of the set for the mass center that all categories calculate as disaggregated model.

The text document of classification unknown for one, to model is used, it is converted first, is generated in the above method Text matrix is placed in disaggregated model by text matrix, obtains the classification of the prediction of the unknown text.

Compared with prior art, the present invention effectively effect is as follows:

1, the defect of traditional bag of words is overcome, such as a specific text, in bag of words Word is unordered, therefore cannot consider the context relation in document between word；

2, TF-IDF algorithm is improved according only to calculated term weighing and according to the cosine similarity between weight vectors Calculate the deficiency of semantic distance between measuring document；

Although the thought computation complexity for 3, calculating semantic distance between text is higher, can calculate more accurately Measuring similarity between text is as a result, can not only make more similar text distance in vector space closer, and can also allow More dissimilar text distance in vector space is remoter, i.e., so that different classes of text has clearer classification boundaries, This effect bring promotion will be also embodied in the training process of subsequent machine learning classification model, the training of disaggregated model Data set has clearer class discrimination, promotes the accuracy of text classification so as to train better disaggregated model.

Detailed description of the invention

It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without creative efforts, can also obtain according to these attached drawings Obtain other attached drawings.

Fig. 1 is the techniqueflow chart that the present invention classifies to text；

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that the described embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.Based on this Embodiment in invention, those of ordinary skill in the art's every other embodiment obtained belong to the model that the present invention protects It encloses.

As shown in Figure 1, detailed process includes: to Chinese text for the techniqueflow chart that the present invention classifies to text Carry out word segmentation processing, filtering stop words etc.；Word frequency is counted, the bag of words for generating text indicate, i.e. original text vector；It uses TF-IDF algorithm updates each characteristic weighted value in bag of words text vector；For full text set, above-mentioned step is used The bag of words text vector generated in rapid is as training corpus, using word2vec training term vector；What combined training came out Term vector and text vector constitute a text matrix；Crossing operation is carried out to text matrix, obtain between text it is semantic away from From；Train the text classifier calculated based on distance；It is text matrix by the text conversion of unknown classification, by text square Battle array is input to disaggregated model, obtains the prediction result of the generic of text.It is embodied as follows:

For text categorization task, it is assumed that have one group of text data set D for having marked classification_X={ D_x1,D_x2,…, D_xkAnd its corresponding class label L_X={ L_x1,L_x2,…,L_xk}。

Wherein D_xiRepresent document data concentration a specific text document, belonging to classification be L_xi。

1. the spatial model of document indicates

Since in the vector space model of Chinese text, characteristic item minimum unit is Chinese terms, so for one The continuous candidate documents D of language_xi, it is necessary first to automatic word segmentation processing is carried out to continuous text with Algorithm for Chinese Word Segmentation, after filtering Auxiliary word, modal particle etc. do not indicate sincere word, finally by text segmentation at a string of continuous word combinations.

Then the continuous word combination counted generates the bag of words [(t of text according to statistical result₁,f₁), (t₂,f₂),…,(t_n,f_n)], each single item of bag of words is all a binary group, the first bit element t in binary group_iIndicate word Itself, second element f_iIt is its frequency in the text.It is worth noting that, in the treatment process for all text documents In, the word position sequence (t of bag of words₁,t₂,…,t_n) be fixed and invariable.

To the bag of words of each text, word t_iWeight be the frequency of occurrences f of the word in the text_i, in order to more quasi- Really assess some particular words t_iIn specific document document D_xiIn importance, need this according to carry out TF-IDF calculation formula more The weighted value w of new each word_i:

K(t_i,D_xi) it is new weighted value after calculating, wherein tf (t_i,D_xi) it is word t_iIn document D_xiIn appearance The frequency, idf (t_i) it is word t_iThe inverse of the frequency occurred in whole document sets.

Any one text document D can be obtained according to above-mentioned process_iVector space model indicate

D_xi=[(t₁,w₁),(t₂,w₂),…,(t_n,w_n)

Since the word position sequence of bag of words immobilizes, each word has one to consolidate in model vector Fixed index position, therefore text vector can be indicated are as follows:

D_xi=[w₁,w₂,…,w_n]

So far, we have been obtained the vector expression of text, and natural language text is converted to can be can be with Carry out the text vector of mathematical modeling and mathematical computations.

2. the training of term vector

The training of the training of term vector can voluntarily build circulation nerve with the TensorFlow deep learning frame of Google Network is trained, and also can use its Open Framework word2vec, because the behind of word2vec algorithm is a shallow-layer Artificial neural network, it can quickly and conveniently train the term vector model needed, can be in the dictionary of million orders of magnitude With efficiently low training is carried out on more than one hundred million data sets, and put into new corpus to the later period can also be to increase on pervious model Amount training, the quality of Optimized model.

The corpus used when can be directly as term vector training of bag of words produced in above-mentioned steps, both may be used in this way So that training semantic information entrained by the term vector come is more bonded actual task, it can also guarantee the word used of document sets It can be indicated by term vector model.

The term vector model complete for training, word t_iTerm vector indicate shaped like:

Default value when wherein the vector dimension m of term vector is training, usually selection 100-1000.

3. the matrix model of document indicates

Vector model and term vector model based on above-mentioned document, document can use the TF-IDF set of weights based on term vector Expression is closed, for any one text document D_xi=[w₁,w₂,…,w_n], by each characteristic item t in vector_iWeight w_iWith The term vector of this feature itemNumber multiplies, and obtains a new vectorOriginal weighted value mark is replaced with this new vector One text vector, is just extended to the text matrix of n × m size by amount:

Therefore, text document specific for one, the form of expression of text matrix, which is not only able to express each word, to exist Word grammer (i.e. context relation) and its semantic information are still included, by text by the importance degree in text The scale of construction of entrained information and increased quality are to new a dimension and level.

4. the calculating of the semantic distance of text document

Above-mentioned text matrix also can be expressed as the vector form of following vector:

WhereinIndicate document D_xiIn word t where j-th of position_jWeight w_jWith the word Words and phrases vectorThe multiplied vector arrived of number.

For two candidate documents D_xAnd D_y:

The calculating of its semantic distance can be calculated with following formula:

Wherein sun () is used for all elements of accumulated matrix, by D_xExtending to n dimension is to calculate D_xIn any j-th The vector and D of position_yIn all position vectors dot product andTherefore two texts being finally calculated This semantic distance formula is as follows:

By calculating process also it is known that the calculation method not only considers the weight of each word in text, also it is contemplated each Co-occurrence probabilities, semantic phase in two documents that the term vector operation of the word of different location is reflected between all characteristics Closing property and context relation, have broken the inherent limitations of conventional statistics calculation method, have realized from the document language for counting on grammer Justice can more accurately and reliably calculate correlation result between document apart from comprehensive assessment.

5. the training of classifier

For all markd training text collection D_X={ D_x1,D_x2,…,D_xk, according to the label data L of data set_X ={ L_x1,L_x2,…,L_xkStatistics text data data amount check of all categories, using the method for stratified sampling, according to different texts Scale of the amount of text in full text in classification has from each classification according to weighting quantity selection is some The sample of " representativeness " part sample in feature space as the category.For a classification in training sample, every time from Several samples are randomly selected in classification, and calculate the mass center of these samples, repeatedly, are chosen from each classification with press proof The number of the mass center of this calculating depends on the quantity of sample in the category, the set for the mass center for finally calculating all categories Sample space as disaggregated model.

6. predicting the text of unknown classification

For a unknown sample, first to above-mentioned steps 1,2,3 are first used, text document is converted into document matrix, it is raw It at the text matrix in the above method, is placed in disaggregated model by text matrix, according to semantic distance meter between the text of step 4 Calculation method calculates the semantic distance of all categories mass center in destination document and feature space, with it closest to K in feature space The known sample of a (language distance is nearest i.e. in language model), if it is specific to belong to some most of in this K sample Classification then predicts that the unknown sample also belongs to this particular category.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of file classification method for calculating semantic distance based on combinatorial matrix, which is characterized in that comprising steps of

S1, Chinese text is handled, generates the vector space model based on bag of words；

S2, for full text set, use the bag of words text vector of generation as training corpus, instructed using word2vec Practice term vector, and the term vector and text vector that combined training comes out, constitutes a text matrix；

S3, crossing operation is carried out to text matrix, obtains the semantic distance between text.

2. the file classification method according to claim 1 for calculating semantic distance based on combinatorial matrix, which is characterized in that institute Step S1 is stated to specifically include:

S1.1, automatic word segmentation processing is carried out to the Chinese text with Algorithm for Chinese Word Segmentation, while filtered without sincere word, by text It is divided into a string of continuous word combinations；

S1.2, statistics word frequency generate the bag of words expression of text, i.e. original text vector；

S1.3, each characteristic weighted value in the bag of words text vector is updated using TF-IDF algorithm, obtains text Vector expression.

3. the file classification method according to claim 2 for calculating semantic distance based on combinatorial matrix, which is characterized in that root Result generates the bag of words [(t of text according to statistics₁,f₁),(t₂,f₂),…,(t_n,f_n)], each single item of the bag of words is all It is a binary group, the first bit element t in binary group_iIndicate word itself, second element f_iIt is its frequency in the text, And in the treatment process for all text documents, the word position sequence (t of bag of words₁,t₂,…,t_n) it is to fix not Become；

In order to more accurately assess some particular words t_iIn specific document D_xiIn importance, need this according to carry out TF-IDF meter Calculate the weighted value w that formula updates each word_i:

K(t_i,D_xi) it is new weighted value after calculating, wherein tf (t_i,D_xi) it is word t_iIn document D_xiIn frequency of occurrence, idf(t_i) it is word t_iThe inverse of the frequency occurred in whole document sets；

Therefore, the vector space model that any one text document Di can be obtained indicates

D_xi=[(t₁,w₁),(t₂,w₂),…,(t_n,w_n)

It can be obtained after simplification

D_xi=[w₁,w₂,…,w_n]。

4. the file classification method according to claim 3 for calculating semantic distance based on combinatorial matrix, which is characterized in that institute It states step S2 to specifically include: for any one text document D_xi=[w₁,w₂,…,w_n], by each characteristic item t in vector_i Weight w_iWith the term vector of this feature itemNumber multiplies, and obtains a new vectorIt is replaced originally with this new vector One text vector is just extended to the text matrix of n × m size by weighted value scalar

It can be obtained after simplification

Whereint_j∈D_xi, indicate document D_xiIn word t where j-th of position_jWeight w_jWith the word word to AmountThe multiplied vector arrived of number.

5. the file classification method according to claim 4 for calculating semantic distance based on combinatorial matrix, which is characterized in that institute It states in step S3, for two candidate documents D_xAnd D_y:

Wherein sun () is used for all elements of accumulated matrix, by D_xExtending to n dimension is to calculate D_xIn any j-th of position Vector and D_yIn all position vectors dot product andTherefore the language for two texts being finally calculated Adopted range formula is as follows:

6. a kind of file classification method, which is characterized in that comprising steps of

S3, crossing operation is carried out to text matrix, obtains the semantic distance between text；

S4, the text classifier calculated based on distance is trained；

S5, text Input matrix to disaggregated model is obtained into the affiliated of text for text matrix by the text conversion of unknown classification The prediction result of classification.

7. file classification method according to claim 6, which is characterized in that the step S1 is specifically included:

8. file classification method according to claim 6, which is characterized in that the step S2 is specifically included:

For any one text document D_xi=[w₁,w₂,…,w_n], by each characteristic item t in vector_iWeight w_iWith the spy Levy the term vector of itemNumber multiplies, and obtains a new vectorOriginal weighted value scalar is replaced with this new vector, just One text vector is extended to the text matrix of n × m size

It can be obtained after simplification

9. file classification method according to claim 6, which is characterized in that in the step S3, for two candidate texts Shelves D_xAnd D_y: