CN106294689A - A kind of method and apparatus selecting based on text category feature to carry out dimensionality reduction - Google Patents

A kind of method and apparatus selecting based on text category feature to carry out dimensionality reduction Download PDF

Info

Publication number
CN106294689A
CN106294689A CN201610639904.8A CN201610639904A CN106294689A CN 106294689 A CN106294689 A CN 106294689A CN 201610639904 A CN201610639904 A CN 201610639904A CN 106294689 A CN106294689 A CN 106294689A
Authority
CN
China
Prior art keywords
text
feature
lexical item
dimensionality reduction
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610639904.8A
Other languages
Chinese (zh)
Other versions
CN106294689B (en
Inventor
张达
亓开元
苏志远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201610639904.8A priority Critical patent/CN106294689B/en
Publication of CN106294689A publication Critical patent/CN106294689A/en
Application granted granted Critical
Publication of CN106294689B publication Critical patent/CN106294689B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of method and apparatus selecting based on text category feature and carrying out dimensionality reduction, and the method comprising the steps of: obtains pending text;Use HanLP to carry out participle and obtain multiple lexical item, remove the stop words in described lexical item;Statistics word frequency, lexical item document frequency and document word number;Lexical item, word frequency and lexical item document frequency and document word number are stored and are formed primary text vector;Primary text vector is carried out information gain calculating, sorts according to the size of information gain amount, the vocabulary meeting preset requirement is formed the reference vector of feature selection;Pending text is carried out dimensionality reduction according to reference vector, forms the text vector after dimensionality reduction.This device includes: acquisition module, word-dividing mode, statistical module, vector module, information gain computing module and dimensionality reduction module.The method and device, carry out text feature selection based on information gain algorithm, and Feature Words collection resultant vector is carried out dimension-reduction treatment, decreases the excessive computation burden caused of dimension.

Description

A kind of method and apparatus selecting based on text category feature to carry out dimensionality reduction
Technical field
The present invention relates to machine learning techniques field, particularly to a kind of side carrying out dimensionality reduction based on the selection of text category feature Method and device.
Background technology
Along with the high speed development of the Internet, the Continuous Innovation of the Internet correlation technique so that the informationalized one-tenth of entire society This, efficiency is all compared with 10 years, there occurs huge change before 20 years.Additionally, becoming increasingly popular of the Internet creates many differences The Data Source that the data (text, multimedia etc.) of form are different with many, in the face of magnanimity information, people can not be simple Manually process all of information resources, but need aid to help people preferably to find, filter and manage this A little electronic information data and resource.
The software that traditional text-processing is relevant is both for text and processes, however as multiple text formatting Appearance, carrying electronic information file be no longer limited to single file type, especially as the development of Internet, The text of these forms also shows respective superiority, and the limitation for the processing system of single formatted file is got over the most therewith Come the most obvious.
The expression of text is abstracted as the space vector of feature word set, but original candidate feature word set is up to hundreds of thousands Dimension, high-dimensional text representation then causes the great burden in calculating.
Summary of the invention
The present invention provides a kind of method and apparatus selecting based on text category feature and carrying out dimensionality reduction, asks solving above-mentioned technology Topic.
A kind of method selecting based on text category feature to carry out dimensionality reduction that the present invention provides, including step:
Step A, obtains the details of pending data source text and stores;
Step B, uses HanLP that described data source text is carried out participle and obtains multiple lexical item, remove in described lexical item Stop words;
Step C, statistics word frequency, lexical item document frequency and document word number;
Step D, described lexical item, word frequency and lexical item document frequency and document word number are stored and are formed primary text to Amount;
Step E, carries out information gain calculating to described primary text vector, obtains the information gain amount of each lexical item, according to The multiple vocabulary meeting preset requirement are formed the reference vector of feature selection by the size sequence of described information gain amount;
Step F, carries out dimensionality reduction by pending text according to described reference vector, forms the text vector after dimensionality reduction.
Wherein, step E carries out information gain calculate and include step:
Using every text as a classification, using the lexical item in text as feature, calculate information according to equation below and increase Beneficial amounts
I G ( T ) = - Σ i = 1 n P ( C i ) × log 2 P ( C i ) + P ( t ) × Σ i = 1 n P ( C i | t ) × log 2 P ( C i | t ) + P ( t ‾ ) × Σ i = 1 n P ( C i | t ‾ ) × log 2 P ( C i | t ‾ )
Wherein, N represents total classification number, P (Ci) represent classification CiThe probability occurred, it is general that P (t) represents that feature (T) occurs Rate,Represent feature (T) absent variable probability, P (Ci| t), represent that text comprises feature (T) and belongs to classification CiProbability.
Wherein, in step E,Wherein DFTRepresent the document frequency of feature (T);
P ( t ‾ ) = 1 - P ( t ) ;
Wherein TFiRepresent the frequency of occurrences of each lexical item;
P ( C i | t ‾ ) = P ( t ‾ | C i ) × P ( C i ) P ( t ‾ ) .
The embodiment of the present invention also provides for a kind of text category feature and selects to carry out the device of dimensionality reduction, including acquisition module, participle Module, statistical module, vector module, information gain computing module and dimensionality reduction module;
Acquisition module, for obtaining the details of pending data source text and storing;
Word-dividing mode, is used for using HanLP that described data source text carries out participle and obtains multiple lexical item, remove institute's predicate Stop words in Xiang;
Statistical module, is used for adding up word frequency (frequency of occurrences of each lexical item) and lexical item document frequency and document word number;
Vector module, for storing lexical item, word frequency and lexical item document frequency and document word number and formed primary text Vector;
Information gain computing module, for primary text vector is carried out information gain calculating, obtains the information of each lexical item Amount of gain, according to information gain amount size sort, by meet preset requirement multiple vocabulary formed feature selection benchmark to Amount;
Dimensionality reduction module, for pending text is carried out dimensionality reduction according to described reference vector, forms the text after dimensionality reduction Vector.
Wherein, described information gain computing module, it is used for:
Using every text as a classification, using the lexical item in text as feature, calculate information according to equation below and increase Beneficial amounts
I G ( T ) = - Σ i = 1 n P ( C i ) × log 2 P ( C i ) + P ( t ) × Σ i = 1 n P ( C i | t ) × log 2 P ( C i | t ) + P ( t ‾ ) × Σ i = 1 n P ( C i | t ‾ ) × log 2 P ( C i | t ‾ )
Wherein, N represents total classification number, P (Ci) represent classification CiThe probability occurred, it is general that P (t) represents that feature (T) occurs Rate,Represent feature (T) absent variable probability, P (Ci| t), represent that text comprises feature (T) and belongs to classification CiProbability;
Wherein DFTRepresent the document frequency of feature (T);
P ( t ‾ ) = 1 - P ( t ) ;
Wherein TFiRepresent the frequency of occurrences of each lexical item;
P ( C i | t ‾ ) = P ( t ‾ | C i ) × P ( C i ) P ( t ‾ ) .
Embodiments provide a kind of method and apparatus selecting based on text category feature and carrying out dimensionality reduction, pass through HanLP participle, remove stop words, lexical item is carried out information gain calculating as feature, according to steps such as information gain amount sequences Obtain reference vector, further according to reference vector document carried out dimension-reduction treatment, a kind of realize based on information gain algorithm File characteristics dimension-reduction treatment method, reduces the dimension of file characteristics word set, decreases what hundreds of thousands dimensional feature word set was caused Computation burden.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet that the present invention selects to carry out one embodiment of method of dimensionality reduction based on text category feature;
The schematic flow sheet of the embodiment carrying out text feature selection that Fig. 2 position embodiment of the present invention two provides.
Detailed description of the invention
Embodiments provide a kind of method and apparatus selecting based on text category feature and carrying out dimensionality reduction, be a kind of base In the algorithm that the text category feature of information gain (Information Gain, IG) selects, extract can have by the text Representative, maximally effective feature, to reduce data set dimension.In information gain, the criterion of importance is seen exactly Feature can bring how much information for categorizing system, and the information brought is the most, and this feature is the most important.
The embodiment of the present invention uses HanLP participle technique that text is carried out participle, and its principle is that structure one is sufficiently large Comprise the dictionary of all Chinese words being likely to occur, it is judged that whether pending Chinese text Chinese character string occurs in dictionary In, once find, identify a word, and this word is split from Chinese character string, until Chinese character string is divided complete.HanLP Possess perfect in shape and function, performance efficiency, framework is clear, language material is stylish, the feature that can customize.While abundant function is provided, HanLP internal module adheres to that lower coupling, model adhere to that inertia loads, service adheres to that static offer, dictionary adhere to that plaintext is issued, and makes With very convenient, carry some language material handling implements simultaneously, help user to train the language material of oneself.But the shortcoming of its maximum is exactly The accuracy rate performed all places one's entire reliance upon dictionary, needs to be updated dictionary.
Information Gain Method occurs by feature or occurs without judging that text generic is provided in the text The size of quantity of information.Filtration problem is used for measure whether a known feature comes across in certain theme related text for this Theme prediction has how much information to contribute.By calculate information gain can obtain those frequencies of occurrences in positive example sample high and The feature that in non-positive example sample, the frequency of occurrences is low.Information gain relates to more mathematical theory and complicated entropy theory formula, this It is the quantity of information that whole classification can be provided by that inventive embodiments is defined as certain characteristic item, does not consider the entropy of any feature and examines Consider the difference of the entropy after this feature.The embodiment of the present invention, according to training data, calculates the information gain of each characteristic item, deletes The item that information gain is the least, remaining sorts from big to small according to information gain and screens.
Embodiment one
Specifically, shown in Figure 1, the method comprising the steps of:
Step S110, obtains the details of pending data source text and stores.
Obtain data source text details, and store in HDFS, and retain backup, for subsequent survey or data Review use.
Step S111, uses HanLP that described data source text is carried out participle and obtains multiple lexical item, remove in described lexical item Stop words.
The effective information of one text is typically mainly made up of notional words such as noun, adjective, verb, measure word, and which it belongs to Individual classification is also mainly distinguished by these notional words, and also has some words all frequently occurred in all texts and do not have reality to contain What text classification contributed by the function word of justice almost without.These stop words do not have a great deal of practical meanings, but at literary composition The most often occur in Ben, if not removed, two diverse texts of content may be made because these are substantial amounts of common There is information to distinguish, influence whether the Feature Selection stage below simultaneously, increase system-computed expense, finally affect classification The structure of device.Therefore, by disabling dictionary, after text is carried out word segmentation processing, the word that will be present in dictionary directly filters out.
Step S112, statistics word frequency, lexical item document frequency and document word number.
Use HanLP that text carries out participle, statistics word frequency and lexical item (Term) document frequency and document word number.Its In, word frequency is the frequency that each lexical item (Term) occurs in full text, and lexical item document frequency is that each lexical item is at a literary composition The frequency occurred in Dang, the lexical item quantity that document word number is comprised by a document.
Step S113, stores described lexical item, word frequency and lexical item document frequency and document word number and is formed primary text Vector.
The number of times (word frequency), the lexical item document frequency that lexical item, word are occurred store in memory database, form vectorization literary composition This, calculate read-write for information gain.
Step S114, carries out information gain calculating to described primary text vector, obtains the information gain amount of each lexical item, press Size according to described information gain amount sorts, and the multiple vocabulary meeting preset requirement are formed the reference vector of feature selection.
Text vector is carried out information gain calculating, according to the little sequence that contains much information, retains N number of vocabulary according to demand, make Be characterized the reference vector of selection, and all texts carried out dimensionality reduction according to reference vector, formed the text after final dimensionality reduction to Amount.
The mathematical definition of entropy: assume there is a variable X, its possible value has n kind, is x respectively1,x2,…,xn, take every The probability planting value is P1,P2,…,Pn, then the entropy of variable X is defined as:
H ( X ) = - Σ i = 1 n ( P i × log 2 P i )
The entropy of categorizing system: for a categorizing system, classification C is variable, and its possible value is C1,C2,…,Cn, often The probability that individual classification occurs is P (C1),P(C2),…,P(Cn), wherein n represents categorical measure.The entropy of categorizing system is defined as:
H ( C ) = - Σ i = 1 n P ( C i ) × log 2 P ( C i )
Wherein: P (Ci) represent classification CiThe probability occurred, can use classification CiThe record quantity (number of documents) comprised Estimate divided by the total number of records (total number of documents).That is:
P ( C i ) = P ^ ( C i ) = N C i N
Wherein, N represents the total number of records,Represent classification CiThe record number comprised.
Conditional entropy: assume that feature X has the possible value (x of n kind1,x2,…,xn), then in the case of given X, system Entropy is defined as:
H ( C | X ) = Σ i = 1 n P ( X = X i ) × H ( C | X = X i )
Wherein,
H ( C | X = x i ) = - Σ j = 1 n P ( C j | X = x i ) × log 2 P ( C j | X = x i )
For information gain is for each feature, it is simply that see a feature (T), the when that system having it and do not has it Quantity of information is respectively how many, and difference between the two is exactly the quantity of information that this feature is brought to system, i.e. gain.
Feature (T) can be write as the condition after system entropy originally and fixed character (T) to the information gain that system is brought The difference of entropy:
IG (T)=H (C)-H (C | T)
In Text Classification System, the corresponding lexical item of feature (T), it only has two kinds of values " to occur " or " occurring without ". Represent that feature (T) occurs with t, useRepresent that feature (T) occurs without.
So:
H ( C | T ) = P ( t ) × H ( C | t ) + P ( t ‾ ) × H ( C | t ‾ )
Wherein: P (t) represents the probability that feature (T) occurs,Represent feature (T) absent variable probability.
This formula is further spread out:
H ( C | t ) = - Σ i = 1 n P ( C i | t ) × log 2 P ( C i | t )
H ( C | t ‾ ) = - Σ i = 1 n P ( C 1 | t ‾ ) × log 2 P ( C i | t ‾ )
So IG (T) can further spread out into:
I G ( T ) = - Σ i = 1 n P ( C i ) × log 2 P ( C i ) + P ( t ) × Σ i = 1 n P ( C i | t ) × log 2 P ( C i | t ) + P ( t ‾ ) × Σ i = 1 n P ( C i | t ‾ ) × log 2 P ( C i | t ‾ )
The feature selection of text, is to extract important lexical item in whole text collection, without the concept of classification, institute With needs, problem is carried out extensive, using every text as a classification.Now, the quantity of classification is equal to text collection Chinese version Quantity N.Assume the relevant parameter of information gain formula is estimated based on this kind.
Symbol description:
N, represents total textual data, i.e. total classification number;
P(Ci), represent classification CiThe probability occurred, i.e. text DiThe probability occurred, is equal to
P (t), represents feature (T) probability that occurs, uses and comprises the amount of text of feature (T) divided by total amount of text N, That is:Wherein DFTRepresent the document frequency of feature (T);
Represent feature (T) absent variable probability, equal to 1-P (t);
P(Ci| t), represent that text comprises feature (T) and belongs to classification CiProbability;Here, it is understood that there may be two kinds of estimation sides Formula:
Employing comprises feature (T) and belongs to classification CiAmount of text divided by total textual data, value be 0 or
Launch by Bayesian formula,P(t|Ci) represent classification CiMiddle feature (T) occurs Probability, i.e. feature (T) is in document DiThe probability of middle appearance, usesWherein TFiRepresent the frequency of occurrences of each lexical item; TFTRepresent the frequency that each feature T occurs.
Represent that text comprises feature (T) and belongs to classification CiProbability;Here, it is understood that there may be two kinds of estimation sides Formula:
Use and do not comprise feature (T) and belong to classification CiAmount of text divided by total textual data, value be 0 or
Launch by Bayesian formula,Wherein
It is to be noted that
When estimating P (t), this value may be 1, and this will result inValue be 0 so thatCannot calculate.Use so P (t) is actualEstimate.
P(t|Ci) useEstimate, if TFTValue be 0, the value being this estimation is 0 by this.So it is actual Use:Estimate.
Feature described in embodiments of the present invention refers to the lexical item of text.
Those skilled in the art technical scheme according to embodiments of the present invention can determine that each parameter defines, the embodiment of the present invention The most all enumerate.
Step S115, carries out dimensionality reduction by pending text according to described reference vector, forms the text vector after dimensionality reduction.
What the embodiment of the present invention one proposed carries out feature selection based on information gain algorithm to text, by feature to whole The importance of system, is ranked up screening, thus reaches the purpose of dimensionality reduction, alleviate computation burden feature.
Embodiment two
In the embodiment of the present invention two, select the same embodiment of main flow of the method carrying out dimensionality reduction based on text category feature One, wherein text feature selection flow process is shown in Figure 2, including step:
Step S210, obtains original text.
Step S211, obtains segmenter, uses segmenter that original text is carried out participle.
Step S212, obtains noun filter, uses noun filter that the text after participle carries out noun screening and obtains Name set of words.
Step S213, carries out document frequency statistics and is stored in redis.
Step S214, carries out word frequency statistics and is stored in redis.
Step S215, carries out document forward index.
Step S216, carries out IG calculating according to the result that step S213 and step S214 are added up.
Step S217, the Feature Words persistence that will obtain.
Embodiment three
The embodiment of the present invention three provides a kind of and selects to carry out the device of dimensionality reduction based on text category feature, including acquisition module, Word-dividing mode, statistical module, vector module, information gain computing module and dimensionality reduction module.
Wherein acquisition module, for obtaining the details of pending data source text and storing.
Word-dividing mode, is used for using HanLP that described data source text carries out participle and obtains multiple lexical item, remove institute's predicate Stop words in Xiang.
Statistical module, is used for adding up word frequency (frequency of occurrences of each lexical item) and lexical item document frequency and document word number.
Vector module, for storing described lexical item, word frequency and lexical item document frequency and document word number and form primary Text vector.
Information gain computing module, for primary text vector is carried out information gain calculating, obtains the information of each lexical item Amount of gain, according to information gain amount size sort, by meet preset requirement multiple vocabulary formed feature selection benchmark to Amount.
Dimensionality reduction module, for pending text is carried out dimensionality reduction according to reference vector, forms the text vector after dimensionality reduction.
The expression of text is abstracted as the space vector of feature word set, but original candidate feature word set is up to hundreds of thousands Dimension, on the one hand high-dimensional text representation except causing the burden in calculating, and on the other hand, bigger feature redundancy can cause The decline of classification performance, embodiments provides and a kind of carries out method and the dress of feature extraction based on information gain algorithm Put, reduce the dimension of feature word set, alleviate corresponding computation burden, eliminate redundancy feature and improve classification performance.
It should be noted that device or system embodiment in the embodiment of the present invention can be realized by software, it is possible to Realize in the way of by hardware or software and hardware combining.For hardware view, the hardware configuration framework of the embodiment of the present invention In structure, in addition to CPU, internal memory, network interface and nonvolatile memory, in embodiment, the equipment at device place leads to Often can also include other hardware, such as the forwarding chip etc. of responsible process message.As a example by implemented in software, as a logic Device in meaning, is that computer program instructions corresponding in nonvolatile memory is read by the CPU by its place equipment Formation is run in internal memory.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention Within god and principle, any modification, equivalent substitution and improvement etc. done, within should be included in the scope of protection of the invention.

Claims (5)

1. the method selecting to carry out dimensionality reduction based on text category feature, it is characterised in that include step:
Step A, obtains the details of pending data source text and stores;
Step B, carries out participle to described data source text and obtains multiple lexical item, remove the stop words in described lexical item;
Step C, statistics word frequency, lexical item document frequency and document word number;
Step D, stores described lexical item, word frequency and lexical item document frequency and document word number and is formed primary text vector;
Step E, carries out information gain calculating to described primary text vector, obtains the information gain amount of each lexical item, according to described The multiple vocabulary meeting preset requirement are formed the reference vector of feature selection by the size sequence of information gain amount;
Step F, carries out dimensionality reduction by pending text according to described reference vector, forms the text vector after dimensionality reduction.
The method selecting to carry out dimensionality reduction based on text category feature the most according to claim 1, it is characterised in that described step E carries out information gain calculate and include step:
Using every text as a classification, using the lexical item in text as feature, calculate information gain amount according to equation below
I G ( T ) = - Σ i = 1 n P ( C i ) × log 2 P ( C i ) + P ( t ) × Σ i = 1 n P ( C i | t ) × log 2 P ( C i | t ) + P ( t ‾ ) × Σ i = 1 n P ( C i | t ‾ ) × log 2 P ( C i | t ‾ )
Wherein, N represents total classification number, P (Ci) represent classification CiThe probability occurred, P (t) represents the probability that feature (T) occurs,Represent feature (T) absent variable probability, P (Ci| t), represent that text comprises feature (T) and belongs to classification CiProbability.
The method selecting to carry out dimensionality reduction based on text category feature the most according to claim 2, it is characterised in that described step In E,Wherein DFTRepresent the document frequency of feature (T);
P ( t ‾ ) = 1 - P ( t ) ;
Wherein TFiRepresent the frequency of occurrences of each lexical item;
P ( C i | t ‾ ) = P ( t ‾ | C i ) × P ( C i ) P ( t ‾ ) .
4. a text category feature selects to carry out the device of dimensionality reduction, it is characterised in that include acquisition module, word-dividing mode, statistics Module, vector module, information gain computing module and dimensionality reduction module;
Described acquisition module, for obtaining the details of pending data source text and storing;
Described word-dividing mode, is used for using HanLP that described data source text carries out participle and obtains multiple lexical item, remove institute's predicate Stop words in Xiang;
Described statistical module, is used for adding up word frequency, lexical item document frequency and document word number;
Described vector module, for storing described lexical item, word frequency and lexical item document frequency and document word number and form primary Text vector;
Described information gain computing module, for described primary text vector is carried out information gain calculating, obtains each lexical item Information gain amount, sorts according to the size of described information gain amount, and the multiple vocabulary meeting preset requirement are formed feature selection Reference vector;
Described dimensionality reduction module, for pending text is carried out dimensionality reduction according to described reference vector, forms the text after dimensionality reduction Vector.
Text category feature the most according to claim 4 selects to carry out the device of dimensionality reduction, it is characterised in that
Described information gain computing module, is used for:
Using every text as a classification, using the lexical item in text as feature, calculate information gain amount according to equation below
I G ( T ) = - Σ i = 1 n P ( C i ) × log 2 P ( C i ) + P ( t ) × Σ i = 1 n P ( C i | t ) × log 2 P ( C i | t ) + P ( t ‾ ) × Σ i = 1 n P ( C i | t ‾ ) × log 2 P ( C i | t ‾ )
Wherein, N represents total classification number, P (Ci) represent classification CiThe probability occurred, P (t) represents the probability that feature (T) occurs,Represent feature (T) absent variable probability, P (Ci| t), represent that text comprises feature (T) and belongs to classification CiProbability;
Wherein DFTRepresent the document frequency of feature (T);
P ( t ‾ ) = 1 - P ( t ) ;
Wherein TFiRepresent the frequency of occurrences of each lexical item;
P ( C i | t ‾ ) = P ( t ‾ | C i ) × P ( C i ) P ( t ‾ ) .
CN201610639904.8A 2016-08-05 2016-08-05 A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature Active CN106294689B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610639904.8A CN106294689B (en) 2016-08-05 2016-08-05 A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610639904.8A CN106294689B (en) 2016-08-05 2016-08-05 A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature

Publications (2)

Publication Number Publication Date
CN106294689A true CN106294689A (en) 2017-01-04
CN106294689B CN106294689B (en) 2018-09-25

Family

ID=57665827

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610639904.8A Active CN106294689B (en) 2016-08-05 2016-08-05 A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature

Country Status (1)

Country Link
CN (1) CN106294689B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308317A (en) * 2018-09-07 2019-02-05 浪潮软件股份有限公司 A kind of hot spot word extracting method of the non-structured text based on cluster
CN110472240A (en) * 2019-07-26 2019-11-19 北京影谱科技股份有限公司 Text feature and device based on TF-IDF
CN112906386A (en) * 2019-12-03 2021-06-04 深圳无域科技技术有限公司 Method and device for determining text features

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033949A (en) * 2010-12-23 2011-04-27 南京财经大学 Correction-based K nearest neighbor text classification method
CN102662952A (en) * 2012-03-02 2012-09-12 成都康赛电子科大信息技术有限责任公司 Chinese text parallel data mining method based on hierarchy
CN105095162A (en) * 2014-05-19 2015-11-25 腾讯科技(深圳)有限公司 Text similarity determining method and device, electronic equipment and system
CN105512104A (en) * 2015-12-02 2016-04-20 上海智臻智能网络科技股份有限公司 Dictionary dimension reducing method and device and information classifying method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033949A (en) * 2010-12-23 2011-04-27 南京财经大学 Correction-based K nearest neighbor text classification method
CN102662952A (en) * 2012-03-02 2012-09-12 成都康赛电子科大信息技术有限责任公司 Chinese text parallel data mining method based on hierarchy
CN105095162A (en) * 2014-05-19 2015-11-25 腾讯科技(深圳)有限公司 Text similarity determining method and device, electronic equipment and system
CN105512104A (en) * 2015-12-02 2016-04-20 上海智臻智能网络科技股份有限公司 Dictionary dimension reducing method and device and information classifying method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308317A (en) * 2018-09-07 2019-02-05 浪潮软件股份有限公司 A kind of hot spot word extracting method of the non-structured text based on cluster
CN110472240A (en) * 2019-07-26 2019-11-19 北京影谱科技股份有限公司 Text feature and device based on TF-IDF
CN112906386A (en) * 2019-12-03 2021-06-04 深圳无域科技技术有限公司 Method and device for determining text features
CN112906386B (en) * 2019-12-03 2023-08-11 深圳无域科技技术有限公司 Method and device for determining text characteristics

Also Published As

Publication number Publication date
CN106294689B (en) 2018-09-25

Similar Documents

Publication Publication Date Title
US8484228B2 (en) Extraction and grouping of feature words
CN103914478B (en) Webpage training method and system, webpage Forecasting Methodology and system
RU2377645C2 (en) Method and system for classifying display pages using summaries
CN101364239B (en) Method for auto constructing classified catalogue and relevant system
CN106250526A (en) A kind of text class based on content and user behavior recommends method and apparatus
CN104573054A (en) Information pushing method and equipment
CN110263248A (en) A kind of information-pushing method, device, storage medium and server
CN104317784A (en) Cross-platform user identification method and cross-platform user identification system
Sabuna et al. Summarizing Indonesian text automatically by using sentence scoring and decision tree
US20140229486A1 (en) Method and apparatus for unsupervised learning of multi-resolution user profile from text analysis
CN105975459A (en) Lexical item weight labeling method and device
CN103838798A (en) Page classification system and method
CN106096609A (en) A kind of merchandise query keyword automatic generation method based on OCR
CN110287314A (en) Long text credibility evaluation method and system based on Unsupervised clustering
Shi et al. Mining chinese reviews
CN110019820A (en) Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history
CN106294689A (en) A kind of method and apparatus selecting based on text category feature to carry out dimensionality reduction
CN111444713B (en) Method and device for extracting entity relationship in news event
CN104881447A (en) Searching method and device
Lin Association rule mining for collaborative recommender systems.
CN109992665A (en) A kind of classification method based on the extension of problem target signature
Campbell et al. Content+ context networks for user classification in twitter
CN107291686B (en) Method and system for identifying emotion identification
CN109871429B (en) Short text retrieval method integrating Wikipedia classification and explicit semantic features
Bollegala et al. Extracting key phrases to disambiguate personal name queries in web search

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant