CN109933670A - A kind of file classification method calculating semantic distance based on combinatorial matrix - Google Patents

A kind of file classification method calculating semantic distance based on combinatorial matrix Download PDF

Info

Publication number
CN109933670A
CN109933670A CN201910209354.XA CN201910209354A CN109933670A CN 109933670 A CN109933670 A CN 109933670A CN 201910209354 A CN201910209354 A CN 201910209354A CN 109933670 A CN109933670 A CN 109933670A
Authority
CN
China
Prior art keywords
text
vector
word
matrix
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910209354.XA
Other languages
Chinese (zh)
Other versions
CN109933670B (en
Inventor
裘嵘
杨俊杰
张祖平
罗律
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN201910209354.XA priority Critical patent/CN109933670B/en
Publication of CN109933670A publication Critical patent/CN109933670A/en
Application granted granted Critical
Publication of CN109933670B publication Critical patent/CN109933670B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of file classification methods that semantic distance is calculated based on combinatorial matrix, comprising steps of S1, handling Chinese text, generate the vector space model based on bag of words;S2, for full text set, use the bag of words text vector of generation as training corpus, using word2vec training term vector, and the term vector and text vector that combined training comes out, constitute a text matrix;S3, crossing operation is carried out to text matrix, obtains the semantic distance between text.The expression of text vector proposed by the present invention and the calculation method of semantic distance, had both overcome the defect of traditional bag of words, also improved the deficiency of TF-IDF algorithm, promoted the accuracy of text classification so as to train better disaggregated model.

Description

A kind of file classification method calculating semantic distance based on combinatorial matrix
Technical field
The invention belongs to relate to natural language processing field, semantic distance is calculated based on combinatorial matrix more particularly to a kind of File classification method.
Background technique
With the rapid development of internet gradually popularized with mechanics of communication, how to the network number increased with index rank Have become intractable and urgent research topic according to efficient organization and administration are carried out.And in these vast as the open sea documents, data It using the content of part is all greatly the process quilt classified automatically to these texts using text as the form of expression very much in data Referred to as text classification, text classification are a kind of technologies that pattern-recognition is combined with natural language processing, it is intended to according in text The attribute or feature of appearance, target text is associated in one or more classifications.Traditional Text Classification is with knowledge engineering Method based on, the expert of related fields artificially extracts the logic rule of some classification according to the classification experience of target text Then, in this, as the foundation of text classification.And in recent years, with machine learning and deep learning the relevant technologies and Computing The promotion of speed, the file classification method based on statistical machine learning start to be favored and in the accuracy rate of classification results and surely Qualitative upper acquirement significant advantage.
The technical process of text classification of the industry based on statistical machine learning at present, usually has several important steps: one, right Natural language text carries out digitization modeling, i.e., real text is expressed as the data mode that computer is capable of efficient process;Two, According to above-mentioned modeling method, all destination documents are all converted into specific data representation form;Three, the number of different document is provided According to the operation relation between expression;Four, the operation relation between the data representation form and different data of text document, design are utilized The machine learning model of text classification is simultaneously trained;Five, to given unknown category documents, it is converted into specific data table Up to form, and trained machine learning model is put into, obtains the class prediction result of the document.
In natural language processing, the digitization modeling of text is typically all to use vector space model (Vector SpaceModel, VSM), i.e., a text document is expressed as a vector in n-dimensional vector space, in vector it is each not With position represent a characteristic item, the numerical values recited of different location indicate the position in the weight of entire vector, both importance Size is made of the whole expression to text the summation of characteristic item.In Chinese, word is to express semanteme most in Chinese language Junior unit, therefore in Chinese natural language treatment process, it is usually all spy of the word as text vector chosen in text Item is levied, each word weight size of position in particular text vector indicates the importance of the word in a document.
The initial expression of bag of words (BagofWord, bow) as vector space model is generally used in practical application, Bag of words fix whole words in all texts with a kind of position, sequence random manner is put, because of each difference Word be all used as a unique characteristic item, so the word complete or collected works that put of these fixations constitute one for expressing The vector space of any text.Text specific for one, the weighted value of each characteristic item is the spy in its text vector The frequency number that the corresponding word of sign item occurs in the text, i.e., measure it at this with the number that word occurs in particular text Importance in text.In addition to text can be indicated with vector space model, word itself can also be in n-dimensional vector space One vector indicates, for a specific word in word complete or collected works, its term vector is indicated with one-hot coding (One-hot), The corresponding value in fixation position of the word where in vector space is 1, and indicates that remaining position of other words is 0.So Text specific for one is also considered as constituting the word of all words of the text by the text vector that bag of words generate Vector adds up.
In bag of words, due to the value only frequency by each position corresponding word in the text of text vector each position It is determined, therefore there is significant limitation.In addition to indicating text vector weight with word frequency, it is most commonly used to calculate on engineer application The method of text vector weight is word frequency inverse document frequency (TermFrequency-InverseDocumentFrequency, TF- IDF) algorithm, the algorithm idea are that the weight size of characteristic item is directly proportional to the frequency that this feature item occurs in a document, with The number of documents comprising this feature item is inversely proportional in entire corpus, and for document specific for one, some word is at this The number occurred in document is more, and importance is higher, if but the number that occurs in other documents of some word is more More illustrate that its generality represented, the weight of the word are also then lower.
For any two document DsxAnd DyBetween similarity measurement, can be by two document vectors in n-dimensional space DxAnd DyCertain distance relation calculate, common method calculated using the inner product between vector, and the inner product of vector can be with For characterizing corner dimension or cosine similarity between two vectors.
However, in real life, Chinese everyday expressions have tens of thousands of, and word included in a document but generally only has Hundreds and thousands of, since the dimension of the vector space of generation is also tens of thousands of dimensions, the nonzero term in vector only has thousands of, obtains in this way Document vector sum term vector all by be higher-dimension sparse vector, the similitude that the inner product of two high-dimensional sparse vectors obtains As a result not only not accurate enough, and model with calculating process absolutely not considers characteristic item in text vector and characteristic item it Between, i.e., the semantic distance between word and word.
In recent years, the development of deep learning theory and the change of technology played deep effect to natural language processing, Wherein important is a kind of distributed expression for being known as word insertion (WordEmbedding) (DistributedRepresentation) technology, distribution are indicated theoretical based on distributional assumption, are obtained using co-occurrence matrix The semantic meaning representation of word, and word insertion then realizes word being mapped to a new space, and with the continuous real vector of multidimensional It indicates.In word embedding grammar, most notable is the word2vec model that Google proposes, is instructed by artificial neural network algorithm Practice language model, and obtains low-dimensional vector corresponding to word in the training process.It is this to indicate term vector using lower dimensional space Method, not only solve the problems, such as dimension disaster, also excavate the relating attribute between word, to improve term vector in language Accuracy in justice expression.
Summary of the invention
The method of semantic distance is different from above-mentioned traditional plan between the method proposed by the present invention for indicating text and calculating text Slightly.
In the expression modeling method of text, it is assumed that word complete or collected works' number for expressing all texts is n, then text The vector space dimension of language model is n dimension, then needs the term vector dimension of training for m dimension in regulation.Have for one The text document of body, first using TF-IDF algorithm calculate document in each word weighted value, obtain text n dimension weight to Amount, but and it is indirect express text using this weight vectors, but for by the value of characteristic item each in weight vectors, i.e. word The weight of language is realized multiplied by the term vector of the word of this feature item position and is embedded into the m dimension term vector of each word respectively In the weight of word, the new m dimension weighting term vector of several multiplied one arrived is replaced into the corresponding specific location of original each word Weighted value finally obtains the expression matrix of n × m size about text, and every a line of text matrix is all a row vector, row The value of vector is equal to term vector and the number of the weight of the word multiplies.It is big that the row vector not only carries the weight of some word in the text It is small, the syntactical and semantical feature of this word is also carried, is just indicated the vector space of text from traditional by this means Vector is extended for matrix, keeps its semantic meaning representation ability and the information content of carrying more abundant.
It is different from the mode that traditional text vector measures text semantic distance by way of calculating cosine similarity, this Invention proposes modeling method of the text semantic distance calculating method based on above-mentioned text representation.For two different texts, number According to changing the matrix that expression-form is n × m, each row vector of document matrix 1 is not only row with 2 corresponding position of document matrix Vector calculates similarity, but calculates similarity with each of document matrix 2 row vector, and calculated result is added up conduct The Semantic Similarity measurement results of two documents, this method thought have semantic distance and grammatical relation between word by term vector Characteristic can more accurately calculate semantic distance and co-occurrence probabilities between the different characteristic items of different texts.
It is chosen in the method for machine learning classification model, the field as designed by the present invention is in natural language processing Text classification, and text according to the different criteria for classifying can there are many classification, to this binary classifier using limited, Therefore frequently with multi-class classifier in task.In multi-class classifier based on distance calculate sorting algorithm than support to The algorithms most in use such as amount machine, multivariate logistic regression are more applicable with current application scene, because the most important foundation of classification task is to be based on Semantic distance between text, and the former has lower algorithm complexity and operand, can speed up the training and prediction of model.
The present invention uses KNN sorting algorithm, and combines the thought of iteration class centroid in K-Means clustering algorithm.KNN is calculated The thinking of method is to calculate a unknown sample (language distance is most i.e. in language model closest to K with it in feature space Known sample closely) predicts that the unknown sample equally belongs to if most of in this K sample belong to some particular category In this particular category.KNN algorithm needs to calculate unknown sample at a distance from samples all in feature space, if feature samples For the complete or collected works of training set, then the calculation amount of a classification task will be very big, therefore the present invention is by the way of " choosing represents " Feature space, the i.e. scale according to the amount of text in different text categories in full text are set, from each class In not according to weighting quantity choose some samples with " representativeness " as the category part sample in feature space.And it is right In the selection of " representativeness " feature, then the method that mass center (cluster centre) is iterated to calculate in K-Means clustering method is drawn, for A classification in training sample, randomly selects several samples, and calculate the mass center of these samples from classification every time, repeats Repeatedly, the quantity of sample in the category is depended on from the number for choosing the mass center that random sample calculates in each classification, finally will Sample space of the set for the mass center that all categories calculate as disaggregated model.
The text document of classification unknown for one, to model is used, it is converted first, is generated in the above method Text matrix is placed in disaggregated model by text matrix, obtains the classification of the prediction of the unknown text.
Compared with prior art, the present invention effectively effect is as follows:
1, the defect of traditional bag of words is overcome, such as a specific text, in bag of words Word is unordered, therefore cannot consider the context relation in document between word;
2, TF-IDF algorithm is improved according only to calculated term weighing and according to the cosine similarity between weight vectors Calculate the deficiency of semantic distance between measuring document;
Although the thought computation complexity for 3, calculating semantic distance between text is higher, can calculate more accurately Measuring similarity between text is as a result, can not only make more similar text distance in vector space closer, and can also allow More dissimilar text distance in vector space is remoter, i.e., so that different classes of text has clearer classification boundaries, This effect bring promotion will be also embodied in the training process of subsequent machine learning classification model, the training of disaggregated model Data set has clearer class discrimination, promotes the accuracy of text classification so as to train better disaggregated model.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without creative efforts, can also obtain according to these attached drawings Obtain other attached drawings.
Fig. 1 is the techniqueflow chart that the present invention classifies to text;
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that the described embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.Based on this Embodiment in invention, those of ordinary skill in the art's every other embodiment obtained belong to the model that the present invention protects It encloses.
As shown in Figure 1, detailed process includes: to Chinese text for the techniqueflow chart that the present invention classifies to text Carry out word segmentation processing, filtering stop words etc.;Word frequency is counted, the bag of words for generating text indicate, i.e. original text vector;It uses TF-IDF algorithm updates each characteristic weighted value in bag of words text vector;For full text set, above-mentioned step is used The bag of words text vector generated in rapid is as training corpus, using word2vec training term vector;What combined training came out Term vector and text vector constitute a text matrix;Crossing operation is carried out to text matrix, obtain between text it is semantic away from From;Train the text classifier calculated based on distance;It is text matrix by the text conversion of unknown classification, by text square Battle array is input to disaggregated model, obtains the prediction result of the generic of text.It is embodied as follows:
For text categorization task, it is assumed that have one group of text data set D for having marked classificationX={ Dx1,Dx2,…, DxkAnd its corresponding class label LX={ Lx1,Lx2,…,Lxk}。
Wherein DxiRepresent document data concentration a specific text document, belonging to classification be Lxi
1. the spatial model of document indicates
Since in the vector space model of Chinese text, characteristic item minimum unit is Chinese terms, so for one The continuous candidate documents D of languagexi, it is necessary first to automatic word segmentation processing is carried out to continuous text with Algorithm for Chinese Word Segmentation, after filtering Auxiliary word, modal particle etc. do not indicate sincere word, finally by text segmentation at a string of continuous word combinations.
Then the continuous word combination counted generates the bag of words [(t of text according to statistical result1,f1), (t2,f2),…,(tn,fn)], each single item of bag of words is all a binary group, the first bit element t in binary groupiIndicate word Itself, second element fiIt is its frequency in the text.It is worth noting that, in the treatment process for all text documents In, the word position sequence (t of bag of words1,t2,…,tn) be fixed and invariable.
To the bag of words of each text, word tiWeight be the frequency of occurrences f of the word in the texti, in order to more quasi- Really assess some particular words tiIn specific document document DxiIn importance, need this according to carry out TF-IDF calculation formula more The weighted value w of new each wordi:
K(ti,Dxi) it is new weighted value after calculating, wherein tf (ti,Dxi) it is word tiIn document DxiIn appearance The frequency, idf (ti) it is word tiThe inverse of the frequency occurred in whole document sets.
Any one text document D can be obtained according to above-mentioned processiVector space model indicate
Dxi=[(t1,w1),(t2,w2),…,(tn,wn)
Since the word position sequence of bag of words immobilizes, each word has one to consolidate in model vector Fixed index position, therefore text vector can be indicated are as follows:
Dxi=[w1,w2,…,wn]
So far, we have been obtained the vector expression of text, and natural language text is converted to can be can be with Carry out the text vector of mathematical modeling and mathematical computations.
2. the training of term vector
The training of the training of term vector can voluntarily build circulation nerve with the TensorFlow deep learning frame of Google Network is trained, and also can use its Open Framework word2vec, because the behind of word2vec algorithm is a shallow-layer Artificial neural network, it can quickly and conveniently train the term vector model needed, can be in the dictionary of million orders of magnitude With efficiently low training is carried out on more than one hundred million data sets, and put into new corpus to the later period can also be to increase on pervious model Amount training, the quality of Optimized model.
The corpus used when can be directly as term vector training of bag of words produced in above-mentioned steps, both may be used in this way So that training semantic information entrained by the term vector come is more bonded actual task, it can also guarantee the word used of document sets It can be indicated by term vector model.
The term vector model complete for training, word tiTerm vector indicate shaped like:
Default value when wherein the vector dimension m of term vector is training, usually selection 100-1000.
3. the matrix model of document indicates
Vector model and term vector model based on above-mentioned document, document can use the TF-IDF set of weights based on term vector Expression is closed, for any one text document Dxi=[w1,w2,…,wn], by each characteristic item t in vectoriWeight wiWith The term vector of this feature itemNumber multiplies, and obtains a new vectorOriginal weighted value mark is replaced with this new vector One text vector, is just extended to the text matrix of n × m size by amount:
Therefore, text document specific for one, the form of expression of text matrix, which is not only able to express each word, to exist Word grammer (i.e. context relation) and its semantic information are still included, by text by the importance degree in text The scale of construction of entrained information and increased quality are to new a dimension and level.
4. the calculating of the semantic distance of text document
Above-mentioned text matrix also can be expressed as the vector form of following vector:
WhereinIndicate document DxiIn word t where j-th of positionjWeight wjWith the word Words and phrases vectorThe multiplied vector arrived of number.
For two candidate documents DxAnd Dy:
The calculating of its semantic distance can be calculated with following formula:
Wherein sun () is used for all elements of accumulated matrix, by DxExtending to n dimension is to calculate DxIn any j-th The vector and D of positionyIn all position vectors dot product andTherefore two texts being finally calculated This semantic distance formula is as follows:
By calculating process also it is known that the calculation method not only considers the weight of each word in text, also it is contemplated each Co-occurrence probabilities, semantic phase in two documents that the term vector operation of the word of different location is reflected between all characteristics Closing property and context relation, have broken the inherent limitations of conventional statistics calculation method, have realized from the document language for counting on grammer Justice can more accurately and reliably calculate correlation result between document apart from comprehensive assessment.
5. the training of classifier
For all markd training text collection DX={ Dx1,Dx2,…,Dxk, according to the label data L of data setX ={ Lx1,Lx2,…,LxkStatistics text data data amount check of all categories, using the method for stratified sampling, according to different texts Scale of the amount of text in full text in classification has from each classification according to weighting quantity selection is some The sample of " representativeness " part sample in feature space as the category.For a classification in training sample, every time from Several samples are randomly selected in classification, and calculate the mass center of these samples, repeatedly, are chosen from each classification with press proof The number of the mass center of this calculating depends on the quantity of sample in the category, the set for the mass center for finally calculating all categories Sample space as disaggregated model.
6. predicting the text of unknown classification
For a unknown sample, first to above-mentioned steps 1,2,3 are first used, text document is converted into document matrix, it is raw It at the text matrix in the above method, is placed in disaggregated model by text matrix, according to semantic distance meter between the text of step 4 Calculation method calculates the semantic distance of all categories mass center in destination document and feature space, with it closest to K in feature space The known sample of a (language distance is nearest i.e. in language model), if it is specific to belong to some most of in this K sample Classification then predicts that the unknown sample also belongs to this particular category.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (9)

1. a kind of file classification method for calculating semantic distance based on combinatorial matrix, which is characterized in that comprising steps of
S1, Chinese text is handled, generates the vector space model based on bag of words;
S2, for full text set, use the bag of words text vector of generation as training corpus, instructed using word2vec Practice term vector, and the term vector and text vector that combined training comes out, constitutes a text matrix;
S3, crossing operation is carried out to text matrix, obtains the semantic distance between text.
2. the file classification method according to claim 1 for calculating semantic distance based on combinatorial matrix, which is characterized in that institute Step S1 is stated to specifically include:
S1.1, automatic word segmentation processing is carried out to the Chinese text with Algorithm for Chinese Word Segmentation, while filtered without sincere word, by text It is divided into a string of continuous word combinations;
S1.2, statistics word frequency generate the bag of words expression of text, i.e. original text vector;
S1.3, each characteristic weighted value in the bag of words text vector is updated using TF-IDF algorithm, obtains text Vector expression.
3. the file classification method according to claim 2 for calculating semantic distance based on combinatorial matrix, which is characterized in that root Result generates the bag of words [(t of text according to statistics1,f1),(t2,f2),…,(tn,fn)], each single item of the bag of words is all It is a binary group, the first bit element t in binary groupiIndicate word itself, second element fiIt is its frequency in the text, And in the treatment process for all text documents, the word position sequence (t of bag of words1,t2,…,tn) it is to fix not Become;
In order to more accurately assess some particular words tiIn specific document DxiIn importance, need this according to carry out TF-IDF meter Calculate the weighted value w that formula updates each wordi:
K(ti,Dxi) it is new weighted value after calculating, wherein tf (ti,Dxi) it is word tiIn document DxiIn frequency of occurrence, idf(ti) it is word tiThe inverse of the frequency occurred in whole document sets;
Therefore, the vector space model that any one text document Di can be obtained indicates
Dxi=[(t1,w1),(t2,w2),…,(tn,wn)
It can be obtained after simplification
Dxi=[w1,w2,…,wn]。
4. the file classification method according to claim 3 for calculating semantic distance based on combinatorial matrix, which is characterized in that institute It states step S2 to specifically include: for any one text document Dxi=[w1,w2,…,wn], by each characteristic item t in vectori Weight wiWith the term vector of this feature itemNumber multiplies, and obtains a new vectorIt is replaced originally with this new vector One text vector is just extended to the text matrix of n × m size by weighted value scalar
It can be obtained after simplification
Whereintj∈Dxi, indicate document DxiIn word t where j-th of positionjWeight wjWith the word word to AmountThe multiplied vector arrived of number.
5. the file classification method according to claim 4 for calculating semantic distance based on combinatorial matrix, which is characterized in that institute It states in step S3, for two candidate documents DxAnd Dy:
The calculating of its semantic distance can be calculated with following formula:
Wherein sun () is used for all elements of accumulated matrix, by DxExtending to n dimension is to calculate DxIn any j-th of position Vector and DyIn all position vectors dot product andTherefore the language for two texts being finally calculated Adopted range formula is as follows:
6. a kind of file classification method, which is characterized in that comprising steps of
S1, Chinese text is handled, generates the vector space model based on bag of words;
S2, for full text set, use the bag of words text vector of generation as training corpus, instructed using word2vec Practice term vector, and the term vector and text vector that combined training comes out, constitutes a text matrix;
S3, crossing operation is carried out to text matrix, obtains the semantic distance between text;
S4, the text classifier calculated based on distance is trained;
S5, text Input matrix to disaggregated model is obtained into the affiliated of text for text matrix by the text conversion of unknown classification The prediction result of classification.
7. file classification method according to claim 6, which is characterized in that the step S1 is specifically included:
S1.1, automatic word segmentation processing is carried out to the Chinese text with Algorithm for Chinese Word Segmentation, while filtered without sincere word, by text It is divided into a string of continuous word combinations;
S1.2, statistics word frequency generate the bag of words expression of text, i.e. original text vector;
S1.3, each characteristic weighted value in the bag of words text vector is updated using TF-IDF algorithm, obtains text Vector expression.
8. file classification method according to claim 6, which is characterized in that the step S2 is specifically included:
For any one text document Dxi=[w1,w2,…,wn], by each characteristic item t in vectoriWeight wiWith the spy Levy the term vector of itemNumber multiplies, and obtains a new vectorOriginal weighted value scalar is replaced with this new vector, just One text vector is extended to the text matrix of n × m size
It can be obtained after simplification
Whereintj∈Dxi, indicate document DxiIn word t where j-th of positionjWeight wjWith the word word to AmountThe multiplied vector arrived of number.
9. file classification method according to claim 6, which is characterized in that in the step S3, for two candidate texts Shelves DxAnd Dy:
The calculating of its semantic distance can be calculated with following formula:
Wherein sun () is used for all elements of accumulated matrix, by DxExtending to n dimension is to calculate DxIn any j-th of position Vector and DyIn all position vectors dot product andTherefore the language for two texts being finally calculated Adopted range formula is as follows:
CN201910209354.XA 2019-03-19 2019-03-19 Text classification method for calculating semantic distance based on combined matrix Active CN109933670B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910209354.XA CN109933670B (en) 2019-03-19 2019-03-19 Text classification method for calculating semantic distance based on combined matrix

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910209354.XA CN109933670B (en) 2019-03-19 2019-03-19 Text classification method for calculating semantic distance based on combined matrix

Publications (2)

Publication Number Publication Date
CN109933670A true CN109933670A (en) 2019-06-25
CN109933670B CN109933670B (en) 2021-06-04

Family

ID=66987629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910209354.XA Active CN109933670B (en) 2019-03-19 2019-03-19 Text classification method for calculating semantic distance based on combined matrix

Country Status (1)

Country Link
CN (1) CN109933670B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109558489A (en) * 2018-12-03 2019-04-02 南京中孚信息技术有限公司 File classification method and device
CN110348497A (en) * 2019-06-28 2019-10-18 西安理工大学 A kind of document representation method based on the building of WT-GloVe term vector
CN110389932A (en) * 2019-07-02 2019-10-29 华北电力科学研究院有限责任公司 Electric power automatic document classifying method and device
CN110457475A (en) * 2019-07-25 2019-11-15 阿里巴巴集团控股有限公司 A kind of method and system expanded for text classification system construction and mark corpus
CN110738049A (en) * 2019-10-12 2020-01-31 招商局金融科技有限公司 Similar text processing method and device and computer readable storage medium
CN110909162A (en) * 2019-11-15 2020-03-24 龙马智芯(珠海横琴)科技有限公司 Text quality inspection method, storage medium and electronic equipment
CN111104508A (en) * 2019-10-25 2020-05-05 重庆邮电大学 Method, system and medium for representing word bag model text based on fault-tolerant rough set
CN111125328A (en) * 2019-12-12 2020-05-08 深圳数联天下智能科技有限公司 Text processing method and related equipment
CN111368552A (en) * 2020-02-26 2020-07-03 北京市公安局 Network user group division method and device for specific field
CN112417893A (en) * 2020-12-16 2021-02-26 江苏徐工工程机械研究院有限公司 Software function demand classification method and system based on semantic hierarchical clustering
CN112818679A (en) * 2019-11-15 2021-05-18 阿里巴巴集团控股有限公司 Event type determination method and device and electronic equipment
CN113011166A (en) * 2021-04-19 2021-06-22 华北电力大学 Relay protection defect text synonym recognition method based on decision tree classification
CN114492420A (en) * 2022-04-02 2022-05-13 北京中科闻歌科技股份有限公司 Text classification method, device and equipment and computer readable storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100094875A1 (en) * 2008-08-11 2010-04-15 Collective Media, Inc. Method and system for classifying text
CN105426923A (en) * 2015-12-14 2016-03-23 北京科技大学 Semi-supervised classification method and system
CN105824922A (en) * 2016-03-16 2016-08-03 重庆邮电大学 Emotion classifying method fusing intrinsic feature and shallow feature
CN107590177A (en) * 2017-07-31 2018-01-16 南京邮电大学 A kind of Chinese Text Categorization of combination supervised learning
US20180114142A1 (en) * 2016-10-26 2018-04-26 Swiss Reinsurance Company Ltd. Data extraction engine for structured, semi-structured and unstructured data with automated labeling and classification of data patterns or data elements therein, and corresponding method thereof
CN108255813A (en) * 2018-01-23 2018-07-06 重庆邮电大学 A kind of text matching technique based on term frequency-inverse document and CRF
CN108897769A (en) * 2018-05-29 2018-11-27 武汉大学 Network implementations text classification data set extension method is fought based on production
US20190034766A1 (en) * 2016-04-21 2019-01-31 Sas Institute Inc. Machine learning predictive labeling system
CN109376352A (en) * 2018-08-28 2019-02-22 中山大学 A kind of patent text modeling method based on word2vec and semantic similarity

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100094875A1 (en) * 2008-08-11 2010-04-15 Collective Media, Inc. Method and system for classifying text
CN105426923A (en) * 2015-12-14 2016-03-23 北京科技大学 Semi-supervised classification method and system
CN105824922A (en) * 2016-03-16 2016-08-03 重庆邮电大学 Emotion classifying method fusing intrinsic feature and shallow feature
US20190034766A1 (en) * 2016-04-21 2019-01-31 Sas Institute Inc. Machine learning predictive labeling system
US20180114142A1 (en) * 2016-10-26 2018-04-26 Swiss Reinsurance Company Ltd. Data extraction engine for structured, semi-structured and unstructured data with automated labeling and classification of data patterns or data elements therein, and corresponding method thereof
CN107590177A (en) * 2017-07-31 2018-01-16 南京邮电大学 A kind of Chinese Text Categorization of combination supervised learning
CN108255813A (en) * 2018-01-23 2018-07-06 重庆邮电大学 A kind of text matching technique based on term frequency-inverse document and CRF
CN108897769A (en) * 2018-05-29 2018-11-27 武汉大学 Network implementations text classification data set extension method is fought based on production
CN109376352A (en) * 2018-08-28 2019-02-22 中山大学 A kind of patent text modeling method based on word2vec and semantic similarity

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIN SONG等: "Research of Chinese Text Classification Methods Based on Semantic Vector and Semantic Similarity", 《IEEE》 *
张敬谊等: "基于词向量特征的文本分类模型研究", 《信息技术与标准化》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109558489A (en) * 2018-12-03 2019-04-02 南京中孚信息技术有限公司 File classification method and device
CN110348497B (en) * 2019-06-28 2021-09-10 西安理工大学 Text representation method constructed based on WT-GloVe word vector
CN110348497A (en) * 2019-06-28 2019-10-18 西安理工大学 A kind of document representation method based on the building of WT-GloVe term vector
CN110389932A (en) * 2019-07-02 2019-10-29 华北电力科学研究院有限责任公司 Electric power automatic document classifying method and device
CN110457475A (en) * 2019-07-25 2019-11-15 阿里巴巴集团控股有限公司 A kind of method and system expanded for text classification system construction and mark corpus
CN110738049A (en) * 2019-10-12 2020-01-31 招商局金融科技有限公司 Similar text processing method and device and computer readable storage medium
CN110738049B (en) * 2019-10-12 2023-04-18 招商局金融科技有限公司 Similar text processing method and device and computer readable storage medium
CN111104508A (en) * 2019-10-25 2020-05-05 重庆邮电大学 Method, system and medium for representing word bag model text based on fault-tolerant rough set
CN111104508B (en) * 2019-10-25 2022-07-01 重庆邮电大学 Method, system and medium for representing word bag model text based on fault-tolerant rough set
CN112818679A (en) * 2019-11-15 2021-05-18 阿里巴巴集团控股有限公司 Event type determination method and device and electronic equipment
CN110909162A (en) * 2019-11-15 2020-03-24 龙马智芯(珠海横琴)科技有限公司 Text quality inspection method, storage medium and electronic equipment
CN111125328A (en) * 2019-12-12 2020-05-08 深圳数联天下智能科技有限公司 Text processing method and related equipment
CN111125328B (en) * 2019-12-12 2023-11-07 深圳数联天下智能科技有限公司 Text processing method and related equipment
CN111368552A (en) * 2020-02-26 2020-07-03 北京市公安局 Network user group division method and device for specific field
CN112417893A (en) * 2020-12-16 2021-02-26 江苏徐工工程机械研究院有限公司 Software function demand classification method and system based on semantic hierarchical clustering
CN113011166A (en) * 2021-04-19 2021-06-22 华北电力大学 Relay protection defect text synonym recognition method based on decision tree classification
CN114492420A (en) * 2022-04-02 2022-05-13 北京中科闻歌科技股份有限公司 Text classification method, device and equipment and computer readable storage medium
CN114492420B (en) * 2022-04-02 2022-07-29 北京中科闻歌科技股份有限公司 Text classification method, device and equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN109933670B (en) 2021-06-04

Similar Documents

Publication Publication Date Title
CN109933670A (en) A kind of file classification method calculating semantic distance based on combinatorial matrix
Kadhim Survey on supervised machine learning techniques for automatic text classification
Swathi et al. An optimal deep learning-based LSTM for stock price prediction using twitter sentiment analysis
Fong et al. Accelerated PSO swarm search feature selection for data stream mining big data
CN110209806B (en) Text classification method, text classification device and computer readable storage medium
CN104951548B (en) A kind of computational methods and system of negative public sentiment index
US20230206000A1 (en) Data-driven structure extraction from text documents
CN108460089A (en) Diverse characteristics based on Attention neural networks merge Chinese Text Categorization
Lei et al. Patent analytics based on feature vector space model: A case of IoT
CN110245229A (en) A kind of deep learning theme sensibility classification method based on data enhancing
CN106611052A (en) Text label determination method and device
CN108009148A (en) Text emotion classification method for expressing based on deep learning
CN109241377A (en) A kind of text document representation method and device based on the enhancing of deep learning topic information
CN110516074A (en) Website theme classification method and device based on deep learning
CN107220311A (en) A kind of document representation method of utilization locally embedding topic modeling
EP4226283A1 (en) Systems and methods for counterfactual explanation in machine learning models
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
Sadr et al. Convolutional neural network equipped with attention mechanism and transfer learning for enhancing performance of sentiment analysis
CN111680225A (en) WeChat financial message analysis method and system based on machine learning
CN113987187A (en) Multi-label embedding-based public opinion text classification method, system, terminal and medium
CN112183652A (en) Edge end bias detection method under federated machine learning environment
CN112784013A (en) Multi-granularity text recommendation method based on context semantics
Chakraborty et al. Bangla document categorisation using multilayer dense neural network with tf-idf
CN108595909A (en) TA targeting proteins prediction techniques based on integrated classifier
CN113934835A (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant