CN105005589B - A kind of method and apparatus of text classification - Google Patents

A kind of method and apparatus of text classification Download PDF

Info

Publication number
CN105005589B
CN105005589B CN201510364152.4A CN201510364152A CN105005589B CN 105005589 B CN105005589 B CN 105005589B CN 201510364152 A CN201510364152 A CN 201510364152A CN 105005589 B CN105005589 B CN 105005589B
Authority
CN
China
Prior art keywords
word
category
classification
text
term vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510364152.4A
Other languages
Chinese (zh)
Other versions
CN105005589A (en
Inventor
邹缘孙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Tencent Cloud Computing Beijing Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201510364152.4A priority Critical patent/CN105005589B/en
Publication of CN105005589A publication Critical patent/CN105005589A/en
Application granted granted Critical
Publication of CN105005589B publication Critical patent/CN105005589B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of method and apparatus of text classification, belong to Internet technical field.Method includes:Obtain term vector, word frequency, weight and the inverse document frequency for each word that text to be sorted includes;According to each term vector of word and the term vector of first category, each first degree of membership between word and the first category is calculated respectively, and the first category is any classification in category set;According to each first degree of membership between word and the first category and word frequency, weight and the inverse document frequency of each word, the second degree of membership between the text and the first category is calculated;The second degree of membership from the category set between selection and the text meets the classification of preparatory condition, and the classification of the selection is defined as to the classification of the text.Device includes:First acquisition module, the first computing module, the second computing module and sort module.The invention provides the accuracy of text classification.

Description

A kind of method and apparatus of text classification
Technical field
The present invention relates to Internet technical field, more particularly to a kind of method and apparatus of text classification.
Background technology
With the development of Internet technology, the text on internet is more and more, and substantial amounts of text provides the user with conveniently While lookup also to user bring very big inconvenience, in face of this problem, text classification has also been proposed, text classification It can determine a classification according to pre-defined subject categories for text, text classified according to classification, so as to convenient User searches.
Prior art provides a kind of method of text classification, Ke Yiwei:Server obtains a large amount of texts manually marked Sample, the feature of these samples of text is obtained, grader is trained according to the feature of these samples of text;Grader is instructed Practice after completing, the text that server can use the grader to classify needs is classified, and detailed process is:Server takes The feature of text to be sorted, according to the feature of text to be sorted, text to be sorted is entered by the grader after training Row classification.
During the present invention is realized, inventor has found that prior art at least has problems with:
The feature of text to be sorted is often a crucial word in text to be sorted, only according to be sorted Text in a crucial word obvious inaccuracy of classifying is carried out to text to be sorted, for example, one on description open The text of hair game capital consumption problem, the feature for this text that server obtains is probably " game ", according to this feature " trip Play " determines that the classification of the text is " game ", but the emphasis of the text is mainly capital consumption problem, by the classification of the text It is defined as " finance and economics " more properly, therefore, the accuracy classified by the feature of the text to the text is low.
The content of the invention
In order to solve problem of the prior art, the invention provides a kind of method and apparatus of text classification.Technical scheme It is as follows:
A kind of method of text classification, methods described include:
Obtain term vector, word frequency, weight and the inverse document frequency for each word that text to be sorted includes;
According to each term vector of word and the term vector of first category, calculate respectively each word with it is described The first degree of membership between first category, the first category are any classification in category set;
According to each first degree of membership between word and the first category and the word frequency of each word, Weight and inverse document frequency, calculate the second degree of membership between the text and the first category;
The second degree of membership from the category set between selection and the text meets the classification of preparatory condition, by institute The classification for stating selection is defined as the classification of the text.
A kind of device of text classification, described device include:
First acquisition module, the term vector of each word included for obtaining text to be sorted, word frequency, weight and inverse Document frequency;
First computing module, for according to each term vector of word and the term vector of first category, calculating respectively Each first degree of membership between word and the first category, the first category are any sort in category set Not;
Second computing module, for according to each first degree of membership and institute between word and the first category The word frequency, weight and inverse document frequency of each word are stated, calculates the second degree of membership between the text and the first category;
Sort module, meet default bar for the second degree of membership from the category set between selection and the text The classification of part, the classification of the selection is defined as to the classification of the text.
In embodiments of the present invention, the term vector of each word included according to text to be sorted, word frequency, weight and inverse The term vector of document frequency and first category, the second degree of membership between the text and first category is calculated, first category is class Not Ji He in any classification, according to the second degree of membership between the text, classification is selected from category set;Due to this hair It is bright when classifying to text to be sorted, it is contemplated that each word that the text includes, therefore improve the accurate of classification Property.
Brief description of the drawings
Fig. 1 is a kind of method flow diagram for text classification that the embodiment of the present invention 1 provides;
Fig. 2-1 is a kind of method flow diagram for text classification that the embodiment of the present invention 2 provides;
Fig. 2-2 is a kind of schematic diagram of the set of words for each classification of generation that the embodiment of the present invention 2 provides;
Fig. 3 is a kind of apparatus structure schematic diagram for text classification that the embodiment of the present invention 3 provides;
Fig. 4 is a kind of structural representation for server that the embodiment of the present invention 4 provides.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention Formula is described in further detail.
Embodiment 1
The embodiments of the invention provide a kind of method of text classification, referring to Fig. 1, wherein, this method includes:
Step 101:Obtain term vector, word frequency, weight and the inverse document frequency for each word that text to be sorted includes;
Step 102:According to the term vector of each word and the term vector of first category, each word and first are calculated respectively The first degree of membership between classification, first category are any classification in category set;
Step 103:According to the first degree of membership and word frequency, the weight of each word between each word and first category And inverse document frequency, calculate the second degree of membership between the text and first category;
Step 104:The second degree of membership from category set between selection and the text meets the classification of preparatory condition, will The classification of selection is defined as the classification of the text.
In embodiments of the present invention, the term vector of each word included according to text to be sorted, word frequency, weight and inverse The term vector of document frequency and first category, the second degree of membership between the text and first category is calculated, first category is class Not Ji He in any classification, according to the second degree of membership between the text, classification is selected from category set;Due to this hair It is bright when classifying to text to be sorted, it is contemplated that each word that the text includes, therefore improve the accurate of classification Property.
Embodiment 2
The embodiments of the invention provide a kind of method of text classification, when the text that server is classified to needs is classified When, in order to improve the accuracy of classification, server can use the method for text classification provided in an embodiment of the present invention treat point The text of class is classified, so as to improve the accuracy of classification.The executive agent of this method is server;Referring to Fig. 2-1, its In, this method includes:
Step 201:Obtain multiple samples of text;
Samples of text is used to train set of words corresponding to each classification in category set;Also, multiple samples of text In each samples of text corresponding to a classification, multiple samples of text can be the other text of any sort in embodiments of the present invention This sample, in order to improve the accuracy of classification, multiple samples of text can be including text corresponding to each classification in category set This sample.For example, category set includes:Finance and economics, amusement, physical culture, fashion, automobile, house property, science and technology, education etc..In selection text During this sample, multiple samples of text can include the samples of text that classification is finance and economics, and classification is the samples of text of amusement, and classification is The samples of text of physical culture, classification are the samples of text of fashion, and classification is the samples of text of automobile, and classification is the text sample of house property This, classification is the samples of text of house property, and classification is the samples of text of education.
In embodiments of the present invention, user can select multiple samples of text, then input multiple samples of text to service Device;Server receives multiple samples of text of user's input.
Step 202:Each samples of text in multiple samples of text is segmented, obtained word is formed into training set Close;
Using existing participle instrument, each samples of text in multiple samples of text is segmented, obtains each text The word that this sample includes;The word composition training set that each text is included.
Wherein, the process segmented using participle instrument to samples of text is prior art, is no longer described in detail herein.
After obtaining training set, step 203 is performed, the word in being gathered using existing clustering method training is carried out Cluster.
Step 203:Word in gathering training clusters, and obtains in multiple set of words and multiple set of words Each set of words classification;
Wherein, this step can be realized by following steps (1) to (3), including:
(1):Obtain the term vector of each word in training set;
Wherein, the term vector of word is used for the vector statement of words of description characteristic, in embodiments of the present invention the word of word Vector refers in particular to the statement of the word vectors of word-based embedded technology construction.
Each word in the method acquisition training set of any acquisition term vector can be used in embodiments of the present invention Term vector, such as using word embedded technology word2vec methods in neutral net language model, obtain the term vector of the word.And And using word embedded technology word2vec methods in neutral net language model, the term vector detailed process for obtaining the word is existing There is technology, no longer describe in detail herein.
Wherein, the term vector of each word in training set is all n-dimensional vector, can be expressed as Wi=(w1, w2... ..., wn).Wi is the term vector of i-th word, WnFor the vector value of the n-th dimensional vector.
Due to " ", the modal particle of " " and " " etc text is classified when do not play a crucial role, therefore, in order to Reduce operand and improve the accuracy of classification, in this step can will " ", the modal particle of " " " " etc remove, The term vector of remaining word in training set is only obtained, then this step can be:
The word of preset kind is obtained in gathering from training, the word of the acquisition is removed in gathering from training, is trained Remaining word in set, obtain the term vector of remaining word.
Wherein, the word of preset kind can be modal particle or auxiliary word etc..And obtain the term vector of remaining word Process is identical with the process for the term vector for obtaining each word in training set, will not be repeated here.
Further, get training set in each word term vector after, by the word of each word and each word to In the corresponding relation for measuring the term vector for being stored in word and word, during in order to classify to text to be sorted, obtaining should During the term vector for the word that text includes, directly from the corresponding relation of word and the term vector of word obtain word word to Amount, the time for the term vector for obtaining word is saved, improves the efficiency to text classification.
(2):According to the word vectors of each word, the distance between any two word in each word is calculated;
For any two word in each word, respectively according to the term vector of the two words, counted according to below equation (1) Calculate the distance between the two words.
Wherein, Wi is the term vector of i-th of word, | Wi| for the vectorial absolute value of i-th of word;Wj is j-th The term vector of word, | Wj| for the absolute value of a vector of j-th of word, dist (Wi, Wj) is i-th of word and j-th of word The distance between.
Wherein, if only obtaining the term vector of each words of description in training set in step (1), step can be:
The term vector of each words of description in being gathered according to training, calculate any two words of description in each words of description it Between distance.
(3):Multiple words that distance is less than to pre-determined distance form a set of words, and obtain being somebody's turn to do for user annotation The classification of set of words.
The distance between two words are used to represent the similarity between two words, if the distance between two words Less than pre-determined distance, it is determined that the two words are similar word, and the two words are put into a set of words, and really The two fixed words belong to same category.Each word in set can will be trained to be classified and group by this method Into multiple set of words;The word that each set of words of the user in multiple set of words includes, it is determined that each word The classification of set;Each set of words is labeled, obtains the classification of each set of words, is then inputted to server each The classification of set of words;Server receives the classification of each set of words of user's input.
Pre-determined distance can be configured and change as needed, pre-determined distance not made in embodiments of the present invention specific Limit;For example, pre-determined distance can be 0.2 or 0.5 etc..
It should be noted that in embodiments of the present invention any clustering method can be used to enter the word in training set Row cluster obtains multiple set of words;For example, using the method for hierarchical cluster, then multiple set of words and multiple can be obtained The relation of set of words, as shown in Fig. 2-2, each circle represents a word, and different levels represent the hierarchical structure of cluster, In cluster result, rower is entered to set of words corresponding to the layer by manually browsing word that each hierarchical structure includes Note.
The word included in embodiments of the present invention using Clustering to multiple samples of text is clustered, and passes through change Mark multiple samples of text to be changed into being labeled multiple set of words, obtain the set of words of each classification, therefore, the present invention Only need mark on a small quantity, save human resources, and shorten label time, improve classification effectiveness.Also, When the set of words of each classification is obtained in the embodiment of the present invention, it is only necessary to obtain a small amount of samples of text, it is not required that Samples of text is labeled, so as to save time and human resources, so as to reach faster classification effectiveness, especially mutual In industry of networking, usual text categories are more, enormous amount, in order to quickly classify to text, the present invention can be used to implement The method that example provides, the classification time is shortened, improves classification effectiveness.
In embodiments of the present invention by the corresponding relation of configuration categories and set of words, so as to realize moving for disaggregated model Move, it is probably the longer news of length to give different business scenario texts, it is also possible to the title or use of shorter video The texts such as the microblogging at family, the classification that different business may be paid close attention to is different, text based thought, it is only necessary in category set Middle increase classification, and the set of words of the increased classification is established, so as to realize the migration of disaggregated model, solve mould Type tackles the classification problem of new scene so that the classification demand that disaggregated model can be under quick response different business scene.
Further, Clustering can not also be used to obtain word collection corresponding to each classification in embodiments of the present invention Close, obtain set of words corresponding to each classification by the way of user's Direct Mark, then step 201-203 could alternatively be: User obtains multiple word composition training set, and according to the word trained in gathering, the word in gathering training divides Class, multiple set of words are obtained, and each set of words in multiple set of words is labeled, obtain each set of words Classification, the classification of each set of words and each set of words is then inputted to server;Server receives user's input The classification of each set of words and the classification of each set of words.
Further, when getting the classification of each set of words, according to the set of words of each classification, each class is calculated Other term vector, the term vector of each classification and each classification is stored in the corresponding relation of classification and term vector, in order to When obtaining the term vector of classification afterwards, it is not necessary to computed repeatedly, the direct category, from classification and the corresponding relation of term vector The middle term vector for obtaining the category.
Wherein, for each classification, calculating the process of the term vector of the category can be:
Obtain the term vector of each word in the set of words of the category;Calculate the average term vector of the term vector obtained simultaneously Using the term vector of the term vector as the category that be averaged.
It should be noted that step 201-203 is the process for the set of words for training each classification, and therefore, step 201- 203 need to perform once, when the text classified afterwards according to the set of words of each classification to needs is classified, no Need to perform step 201-203, it is only necessary to perform step 204 to 208 pairs of texts for needing to classify and classify.
Step 204:According to the set of words of first category, the term vector of first category is obtained, first category is classification collection Any classification in conjunction;
Specifically, the term vector of each word in the set of words of first category is obtained;Calculate the flat of the term vector of acquisition Equal term vector and using the term vector of the term vector as first category that be averaged.Or according to first category, from classification and term vector Corresponding relation in obtain first category term vector.
Wherein, the word collection of first category is obtained using word embedded technology word2vec methods in neutral net language model The term vector of each word in conjunction;Or according to each word in the set of words of first category, from the word of word and word to The term vector of each word is obtained in the corresponding relation of amount;And the word of each classification in category set is obtained by this method The term vector of each word in language set.
Step 205:Obtain term vector, word frequency, weight and the inverse document frequency for each word that text to be sorted includes;
Using existing participle instrument, text to be sorted is segmented, obtains each word that the text includes;Make With word embedded technology word2vec methods in neutral net language model, the term vector of each word is obtained, or according to each Word, the term vector of each word is obtained from the corresponding relation of word and the term vector of word;Include for the text every Individual word, count the word frequency of number that the word occurs in the text as the word;The word is obtained in the text Position, according to position of the word in the text, obtain the weight of the word;And the word is obtained in training is gathered Inverse document frequency.
Wherein, inverse document frequency is also known as anti-document frequency, is the inverse of document frequency;Server obtains the word and trained The process of inverse document frequency in set can be:
The number that the word occurs in training is gathered is obtained, obtains the number for the word that training set includes;Calculate The word frequency of the word is worth to the first numerical value with the ratio of the number, calculates the first numerical value and obtains the word in training set with the number Inverse document frequency in conjunction.
Wherein, server sets the corresponding relation of word position in the text and weight, then according to the word in this article Position in this, the step of obtaining the weight of the word can be:
According to position of the word in the text, the weight of the word is obtained from the corresponding relation of position and weight.
Wherein, can be the mark of the text when server sets the corresponding relation of word position in the text and weight The word of topic, summary or other important positions sets higher weight, is set for the word in the text of the text Relatively low weight.
Step 206:According to the term vector of each word and the term vector of first category, each word and first are calculated respectively The first degree of membership between classification;
Weighed in embodiments of the present invention with the distance between term vector of the term vector of each word and first category Each the first degree of membership between word and first category, then this step can be:
Calculate the distance between the term vector of each word and the term vector of first category respectively, by the word of each word to The distance between term vector of amount and first category is respectively as the first degree of membership between each word and first category.
It should be noted that each word and the category are calculated by the above process for each classification in category set Between the first degree of membership.Also, calculate the process of the distance between the term vector of each word and the term vector of first category It is identical with the process that the distance between two vectors are calculated in step 203, it will not be repeated here.
Step 207:According to the first degree of membership and word frequency, the weight of each word between each word and first category And inverse document frequency, calculate the second degree of membership between text and first category;
Wherein, this step can be realized by following steps (1) and (2), including:
(1):Word frequency, weight, inverse document frequency and the first person in servitude between first category of each word are calculated respectively The product of category degree, obtain the 3rd degree of membership between each word and first category;
For each word, according to the word frequency of each word, weight, inverse document frequency and between first category The product of one degree of membership, the 3rd degree of membership between each word and first category is calculated according to below equation (2):
fwi=pwi*tfwi*idfwi*bWi, c (2)
Wherein, fwiThe 3rd degree of membership between i-th of word and first category, pwiFor the weight of i-th of word, tfwi For the word frequency of i-th of word, idfwiFor the inverse document frequency of i-th of word, bWi, cBetween i-th of word and first category First degree of membership.
(2):The 3rd degree of membership between each word and first category is added up, obtains the text and first category Between the second degree of membership.
According to the 3rd degree of membership between each word and first category, the text and the is calculated according to below equation (3) The second degree of membership between one classification:
It should be noted that each classification in corresponding category set calculates by the above process the text and the category it Between the second degree of membership.
Step 208:The second degree of membership from category set between selection and the text meets the classification of preparatory condition, will The classification of selection is defined as the classification of the text.
Second degree of membership is used to represent the similarity between the text and the category, and preparatory condition can be maximum second Degree of membership, or more than the second degree of membership of the first default value;When second degree of membership of the preparatory condition for maximum, this Step can be:The classification of the second maximum degree of membership in the second degree of membership from category set between selection and the text, The classification of selection is defined as to the classification of the text.
When preparatory condition is the second degree of membership more than the first default value, then this step can be:From category set The classification that the second degree of membership between the text is more than the first default value is obtained, one is randomly choosed from the classification of acquisition Classification, the classification of selection is defined as to the classification of the text.
Further, when getting the second degree of membership between each classification in the text and category set, also use Second degree of membership is normalized equation below (4), the second degree of membership between the text after being normalized;
Then this step can be:The second degree of membership after normalization from category set between selection and the text meets The classification of preparatory condition, the classification of selection is defined as to the classification of the text.
Now, preparatory condition can be the second degree of membership after maximum normalization, or more than the second present count The second degree of membership after the normalization of value.
When preparatory condition for maximum normalization after the second degree of membership, then this step can be:Selected from category set The classification of the degree of membership after the maximum normalization of the second degree of membership after the normalization between the text is selected, by the classification of selection It is defined as the classification of the text.
When preparatory condition be more than the second default value normalization after the second degree of membership, then this step can be:From The classification that the second degree of membership after the normalization between the text is more than the second default value is obtained in category set, from acquisition Classification in randomly choose a classification, the classification of selection is defined as to the classification of the text.
First default value and the second default value can be configured and change as needed, in the embodiment of the present invention In the first default value and the second default value are all not especially limited.
In embodiments of the present invention, the term vector of each word included according to text to be sorted, word frequency, weight and inverse The term vector of document frequency and first category, the second degree of membership between the text and first category is calculated, first category is class Not Ji He in any classification, according to the second degree of membership between the text, classification is selected from category set;Due to this hair It is bright when classifying to text to be sorted, it is contemplated that each word that the text includes, therefore improve the accurate of classification Property.
Embodiment 3
The embodiments of the invention provide a kind of device of text classification, referring to Fig. 3, wherein, the device includes:
First acquisition module 301, the term vector of each word included for obtaining text to be sorted, word frequency, weight And inverse document frequency;
First computing module 302, for the term vector and the term vector of first category according to each word, calculate respectively every The first degree of membership between individual word and first category, first category are any classification in category set;
Second computing module 303, for according to the first degree of membership between each word and first category and each word Word frequency, weight and the inverse document frequency of language, calculate the second degree of membership between the text and first category;
Sort module 303, meet preparatory condition for the second degree of membership from category set between selection and the text Classification, the classification of selection is defined as to the classification of the text.
Further, the first computing module 302, including:
First acquisition unit, for obtaining the term vector of each word in set of words corresponding to first category;
First computing unit, the average term vector of term vector is got and using the term vector that is averaged as first for calculating The term vector of classification;
Second computing unit, for calculate respectively between the term vector of each word and the term vector of first category away from From, using the distance between term vector of the term vector of each word and first category as each word and first category it Between the first degree of membership.
Further, the device also includes:
Second acquisition module, for obtaining multiple samples of text;
Word-dividing mode, for each samples of text in multiple samples of text to be segmented, obtained word is formed Training set;
Cluster module, clustered for the word in gathering training, obtain multiple set of words and multiple words The classification of each set of words in set.
Further, cluster module, including:
Second acquisition unit, for obtaining the term vector of each word in training set;
3rd computing unit, for the word vectors according to each word, calculate between any two word in each word Distance;
Cluster cell, multiple words for distance to be less than to pre-determined distance form a set of words;
3rd acquiring unit, the classification of the set of words for obtaining user annotation.
Further, the second computing module 303, including:
4th computing unit, for calculating word frequency, weight, inverse document frequency and and the first category of each word respectively Between the first degree of membership product, obtain the 3rd degree of membership between each word and first category;
Summing elements, for the 3rd degree of membership between each word and first category to be added up, obtain the text The second degree of membership between first category.
In embodiments of the present invention, the term vector of each word included according to text to be sorted, word frequency, weight and inverse The term vector of document frequency and first category, the second degree of membership between the text and first category is calculated, first category is class Not Ji He in any classification, according to the second degree of membership between the text, classification is selected from category set;Due to this hair It is bright when classifying to text to be sorted, it is contemplated that each word that the text includes, therefore improve the accurate of classification Property.
Embodiment 4
Fig. 4 is the structural representation of server provided in an embodiment of the present invention.The server 1900 can be because of configuration or performance It is different and produce bigger difference, one or more central processing units (central processing can be included Units, CPU) 1922 (for example, one or more processors) and memory 1932, one or more storage applications The storage medium 1930 of program 1942 or data 1944 (such as one or more mass memory units).Wherein, memory 1932 and storage medium 1930 can be it is of short duration storage or persistently storage.One can be included by being stored in the program of storage medium 1930 Individual or more than one module (diagram does not mark), each module can include operating the series of instructions in server.More enter One step, central processing unit 1922 be could be arranged to communicate with storage medium 1930, and storage medium is performed on server 1900 Series of instructions operation in 1930.
Server 1900 can also include one or more power supplys 1926, one or more wired or wireless nets Network interface 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or More than one operating system 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM Etc..
Server 1900 can include memory, and one or more than one program, one of them or one Individual procedure above is stored in memory, and be configured to by one either more than one computing device it is one or one Individual procedure above includes the instruction for being used for being operated below:
Obtain term vector, word frequency, weight and the inverse document frequency for each word that text to be sorted includes;
According to each term vector of word and the term vector of first category, calculate respectively each word with it is described The first degree of membership between first category, the first category are any classification in category set;
According to each first degree of membership between word and the first category and the word frequency of each word, Weight and inverse document frequency, calculate the second degree of membership between the text and the first category;
The second degree of membership from the category set between selection and the text meets the classification of preparatory condition, by institute The classification for stating selection is defined as the classification of the text.
Further, it is described according to each term vector of word and the term vector of first category, calculate respectively described in Each the first degree of membership between word and the first category, including:
Obtain the term vector of each word in set of words corresponding to first category;
The average term vector of term vector is got described in calculating and using the average term vector as the first category Term vector;
The distance between the term vector of each word and the term vector of the first category are calculated respectively, will be described every The distance between term vector of the term vector of individual word and the first category is respectively as each word and described first The first degree of membership between classification.
Further, methods described also includes:
Obtain multiple samples of text;
Each samples of text in the multiple samples of text is segmented, obtained word is formed into training set;
Word in the training set is clustered, obtained in multiple set of words and the multiple set of words Each set of words classification.
Further, the word in the training set clusters, and obtains multiple set of words and described The classification of each set of words in multiple set of words, including:
Obtain the term vector of each word in the training set;
According to the word vectors of each word, the distance between any two word in each word is calculated;
Multiple words that distance is less than to pre-determined distance form a set of words, and obtain institute's predicate of user annotation The classification of language set.
Further, it is described according to each first degree of membership between word and the first category and described every Word frequency, weight and the inverse document frequency of individual word, the second degree of membership between the text and the first category is calculated, wrapped Include:
Calculate respectively the word frequency of each word, weight, inverse document frequency and between the first category the The product of one degree of membership, obtain each the 3rd degree of membership between word and the first category;
Each the 3rd degree of membership between word and the first category is added up, obtains the text and institute State the second degree of membership between first category.
In embodiments of the present invention, the term vector of each word included according to text to be sorted, word frequency, weight and inverse The term vector of document frequency and first category, the second degree of membership between the text and first category is calculated, first category is class Not Ji He in any classification, according to the second degree of membership between the text, classification is selected from category set;Due to this hair It is bright when classifying to text to be sorted, it is contemplated that each word that the text includes, therefore improve the accurate of classification Property.
It should be noted that:The device for the text classification that above-described embodiment provides is in text classification, only with above-mentioned each work( Can module division progress for example, in practical application, can be as needed and by above-mentioned function distribution by different functions Module is completed, i.e., the internal structure of device is divided into different functional modules, described above all or part of to complete Function.In addition, the device of text classification and the embodiment of the method for text classification that above-described embodiment provides belong to same design, its Specific implementation process refers to embodiment of the method, repeats no more here.
One of ordinary skill in the art will appreciate that hardware can be passed through by realizing all or part of step of above-described embodiment To complete, by program the hardware of correlation can also be instructed to complete, described program can be stored in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only storage, disk or CD etc..
The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent substitution and improvements made etc., it should be included in the scope of the protection.

Claims (6)

  1. A kind of 1. method of text classification, it is characterised in that methods described includes:
    Term vector, word frequency, weight and the inverse document frequency for each word that text to be sorted includes are obtained, wherein, term vector Refer to the description of the word vectors of word-based embedded technology construction;
    The term vector of each word in set of words corresponding to first category is obtained, the average word of term vector is got described in calculating Vector and the term vector using the average term vector as the first category, calculate respectively the term vector of each word with The distance between term vector of the first category, by the term vector of the term vector of each word and the first category it Between distance respectively as each first degree of membership between word and the first category, the first category is classification Any classification in set;
    Word frequency, weight, inverse document frequency and the first person in servitude between the first category of each word are calculated respectively The product of category degree, each the 3rd degree of membership between word and the first category is obtained, by each word and institute The 3rd degree of membership stated between first category is added up, and second obtained between the text and the first category is subordinate to Degree;
    The second degree of membership from the category set between selection and the text meets the classification of preparatory condition, by the choosing The classification selected is defined as the classification of the text.
  2. 2. the method as described in claim 1, it is characterised in that methods described also includes:
    Obtain multiple samples of text;
    Each samples of text in the multiple samples of text is segmented, obtained word is formed into training set;
    Word in the training set is clustered, obtained every in multiple set of words and the multiple set of words The classification of individual set of words.
  3. 3. method as claimed in claim 2, it is characterised in that the word in the training set clusters, and obtains To the classification of each set of words in multiple set of words and the multiple set of words, including:
    Obtain the term vector of each word in the training set;
    According to the word vectors of each word, the distance between any two word in each word is calculated;
    Multiple words that distance is less than to pre-determined distance form a set of words, and obtain the classification of the set of words, The classification of the set of words is user annotation.
  4. 4. a kind of device of text classification, it is characterised in that described device includes:
    First acquisition module, term vector, word frequency, weight and the inverse document of each word included for obtaining text to be sorted Frequency, wherein, term vector refers to the description of the word vectors of word-based embedded technology construction;
    First computing module, for obtaining the term vector of each word in set of words corresponding to first category, obtained described in calculating The average term vector of term vector and the term vector using the average term vector as the first category are got, respectively described in calculating The distance between the term vector of each word and term vector of the first category, by the term vector of each word with it is described The distance between term vector of first category respectively as each first degree of membership between word and the first category, The first category is any classification in category set;
    Second computing module, for calculating the word frequency of each word, weight, inverse document frequency and with described first respectively The product of the first degree of membership between classification, each the 3rd degree of membership between word and the first category is obtained, will Each the 3rd degree of membership between word and the first category is added up, and obtains the text and the first category Between the second degree of membership;
    Sort module, meet preparatory condition for the second degree of membership from the category set between selection and the text Classification, the classification of the selection is defined as to the classification of the text.
  5. 5. device as claimed in claim 4, it is characterised in that described device also includes:
    Second acquisition module, for obtaining multiple samples of text;
    Word-dividing mode, for each samples of text in the multiple samples of text to be segmented, obtained word is formed Training set;
    Cluster module, for being clustered to the word in the training set, obtain multiple set of words and the multiple The classification of each set of words in set of words.
  6. 6. device as claimed in claim 5, it is characterised in that the cluster module, including:
    Second acquisition unit, for obtaining the term vector of each word in the training set;
    3rd computing unit, for the word vectors according to each word, calculate any two word in each word it Between distance;
    Cluster cell, multiple words for distance to be less than to pre-determined distance form a set of words;
    3rd acquiring unit, for obtaining the classification of the set of words, the classification of the set of words is user annotation.
CN201510364152.4A 2015-06-26 2015-06-26 A kind of method and apparatus of text classification Active CN105005589B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510364152.4A CN105005589B (en) 2015-06-26 2015-06-26 A kind of method and apparatus of text classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510364152.4A CN105005589B (en) 2015-06-26 2015-06-26 A kind of method and apparatus of text classification

Publications (2)

Publication Number Publication Date
CN105005589A CN105005589A (en) 2015-10-28
CN105005589B true CN105005589B (en) 2017-12-29

Family

ID=54378265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510364152.4A Active CN105005589B (en) 2015-06-26 2015-06-26 A kind of method and apparatus of text classification

Country Status (1)

Country Link
CN (1) CN105005589B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160058587A (en) * 2014-11-17 2016-05-25 삼성전자주식회사 Display apparatus and method for summarizing of document
CN105893410A (en) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 Keyword extraction method and apparatus
CN106874295A (en) * 2015-12-11 2017-06-20 腾讯科技(深圳)有限公司 A kind of method and device for determining service parameter
CN105893444A (en) * 2015-12-15 2016-08-24 乐视网信息技术(北京)股份有限公司 Sentiment classification method and apparatus
CN105631009A (en) * 2015-12-25 2016-06-01 广州视源电子科技股份有限公司 Retrieval method and system based on word vector similarity
CN107229636B (en) * 2016-03-24 2021-08-13 腾讯科技(深圳)有限公司 Word classification method and device
CN106021578B (en) * 2016-06-01 2019-07-23 南京邮电大学 A kind of modified text classification algorithm based on cluster and degree of membership fusion
CN106469192B (en) * 2016-08-30 2021-07-30 北京奇艺世纪科技有限公司 Text relevance determining method and device
CN106503153B (en) * 2016-10-21 2019-05-10 江苏理工学院 Computer text classification system
CN108062954B (en) * 2016-11-08 2020-12-08 科大讯飞股份有限公司 Speech recognition method and device
CN106815310B (en) * 2016-12-20 2020-04-21 华南师范大学 Hierarchical clustering method and system for massive document sets
CN106611054A (en) * 2016-12-26 2017-05-03 电子科技大学 Method for extracting enterprise behavior or event from massive texts
CN108628875B (en) * 2017-03-17 2022-08-30 腾讯科技(北京)有限公司 Text label extraction method and device and server
CN107766426B (en) * 2017-09-14 2020-05-22 北京百分点信息科技有限公司 Text classification method and device and electronic equipment
CN107894986B (en) * 2017-09-26 2021-03-30 北京纳人网络科技有限公司 Enterprise relation division method based on vectorization, server and client
CN108363716B (en) * 2017-12-28 2020-04-24 广州索答信息科技有限公司 Domain information classification model generation method, classification method, device and storage medium
CN108415903B (en) * 2018-03-12 2021-09-07 武汉斗鱼网络科技有限公司 Evaluation method, storage medium, and apparatus for judging validity of search intention recognition
CN108959453B (en) * 2018-06-14 2021-08-27 中南民族大学 Information extraction method and device based on text clustering and readable storage medium
CN109388693B (en) * 2018-09-13 2021-04-27 武汉斗鱼网络科技有限公司 Method for determining partition intention and related equipment
CN110968690B (en) * 2018-09-30 2023-05-23 百度在线网络技术(北京)有限公司 Clustering division method and device for words, equipment and storage medium
CN109740152B (en) * 2018-12-25 2023-02-17 腾讯科技(深圳)有限公司 Text category determination method and device, storage medium and computer equipment
CN110083828A (en) * 2019-03-29 2019-08-02 珠海远光移动互联科技有限公司 A kind of Text Clustering Method and device
CN112149414B (en) * 2020-09-23 2023-06-23 腾讯科技(深圳)有限公司 Text similarity determination method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101794311A (en) * 2010-03-05 2010-08-04 南京邮电大学 Fuzzy data mining based automatic classification method of Chinese web pages
CN104679860A (en) * 2015-02-27 2015-06-03 北京航空航天大学 Classifying method for unbalanced data

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102141977A (en) * 2010-02-01 2011-08-03 阿里巴巴集团控股有限公司 Text classification method and device
CN102141978A (en) * 2010-02-02 2011-08-03 阿里巴巴集团控股有限公司 Method and system for classifying texts

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101794311A (en) * 2010-03-05 2010-08-04 南京邮电大学 Fuzzy data mining based automatic classification method of Chinese web pages
CN104679860A (en) * 2015-02-27 2015-06-03 北京航空航天大学 Classifying method for unbalanced data

Also Published As

Publication number Publication date
CN105005589A (en) 2015-10-28

Similar Documents

Publication Publication Date Title
CN105005589B (en) A kind of method and apparatus of text classification
CN105469096B (en) A kind of characteristic bag image search method based on Hash binary-coding
CN104866572B (en) A kind of network short text clustering method
CN109558487A (en) Document Classification Method based on the more attention networks of hierarchy
CN104750798B (en) Recommendation method and device for application program
WO2022126810A1 (en) Text clustering method
CN107862022B (en) Culture resource recommendation system
CN107301171A (en) A kind of text emotion analysis method and system learnt based on sentiment dictionary
CN106339502A (en) Modeling recommendation method based on user behavior data fragmentation cluster
CN108197144B (en) Hot topic discovery method based on BTM and Single-pass
CN106528528A (en) A text emotion analysis method and device
CN107103043A (en) A kind of Text Clustering Method and system
CN107515877A (en) The generation method and device of sensitive theme word set
CN104298715B (en) A kind of more indexed results ordering by merging methods based on TF IDF
CN112966091B (en) Knowledge map recommendation system fusing entity information and heat
CN104361059B (en) A kind of harmful information identification and Web page classification method based on multi-instance learning
CN104778283B (en) A kind of user's occupational classification method and system based on microblogging
CN106547864A (en) A kind of Personalized search based on query expansion
CN107895000A (en) A kind of cross-cutting semantic information retrieval method based on convolutional neural networks
CN107220311A (en) A kind of document representation method of utilization locally embedding topic modeling
CN101894129B (en) Video topic finding method based on online video-sharing website structure and video description text information
Zhou et al. Relevance feature mapping for content-based multimedia information retrieval
Noel et al. Applicability of Latent Dirichlet Allocation to multi-disk search
Wu et al. ECNU at SemEval-2017 task 3: Using traditional and deep learning methods to address community question answering task
CN106997379A (en) A kind of merging method of the close text based on picture text click volume

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190802

Address after: Shenzhen Futian District City, Guangdong province 518000 Zhenxing Road, SEG Science Park 2 East Room 403

Co-patentee after: Tencent cloud computing (Beijing) limited liability company

Patentee after: Tencent Technology (Shenzhen) Co., Ltd.

Address before: Shenzhen Futian District City, Guangdong province 518000 Zhenxing Road, SEG Science Park 2 East Room 403

Patentee before: Tencent Technology (Shenzhen) Co., Ltd.