CN106294689A

CN106294689A - A kind of method and apparatus selecting based on text category feature to carry out dimensionality reduction

Info

Publication number: CN106294689A
Application number: CN201610639904.8A
Authority: CN
Inventors: 张达; 亓开元; 苏志远
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2016-08-05
Filing date: 2016-08-05
Publication date: 2017-01-04
Anticipated expiration: 2036-08-05
Also published as: CN106294689B

Abstract

The present invention provides a kind of method and apparatus selecting based on text category feature and carrying out dimensionality reduction, and the method comprising the steps of: obtains pending text；Use HanLP to carry out participle and obtain multiple lexical item, remove the stop words in described lexical item；Statistics word frequency, lexical item document frequency and document word number；Lexical item, word frequency and lexical item document frequency and document word number are stored and are formed primary text vector；Primary text vector is carried out information gain calculating, sorts according to the size of information gain amount, the vocabulary meeting preset requirement is formed the reference vector of feature selection；Pending text is carried out dimensionality reduction according to reference vector, forms the text vector after dimensionality reduction.This device includes: acquisition module, word-dividing mode, statistical module, vector module, information gain computing module and dimensionality reduction module.The method and device, carry out text feature selection based on information gain algorithm, and Feature Words collection resultant vector is carried out dimension-reduction treatment, decreases the excessive computation burden caused of dimension.

Description

A kind of method and apparatus selecting based on text category feature to carry out dimensionality reduction

Technical field

The present invention relates to machine learning techniques field, particularly to a kind of side carrying out dimensionality reduction based on the selection of text category feature Method and device.

Background technology

Along with the high speed development of the Internet, the Continuous Innovation of the Internet correlation technique so that the informationalized one-tenth of entire society This, efficiency is all compared with 10 years, there occurs huge change before 20 years.Additionally, becoming increasingly popular of the Internet creates many differences The Data Source that the data (text, multimedia etc.) of form are different with many, in the face of magnanimity information, people can not be simple Manually process all of information resources, but need aid to help people preferably to find, filter and manage this A little electronic information data and resource.

The software that traditional text-processing is relevant is both for text and processes, however as multiple text formatting Appearance, carrying electronic information file be no longer limited to single file type, especially as the development of Internet, The text of these forms also shows respective superiority, and the limitation for the processing system of single formatted file is got over the most therewith Come the most obvious.

The expression of text is abstracted as the space vector of feature word set, but original candidate feature word set is up to hundreds of thousands Dimension, high-dimensional text representation then causes the great burden in calculating.

Summary of the invention

The present invention provides a kind of method and apparatus selecting based on text category feature and carrying out dimensionality reduction, asks solving above-mentioned technology Topic.

A kind of method selecting based on text category feature to carry out dimensionality reduction that the present invention provides, including step:

Step A, obtains the details of pending data source text and stores；

Step B, uses HanLP that described data source text is carried out participle and obtains multiple lexical item, remove in described lexical item Stop words；

Step C, statistics word frequency, lexical item document frequency and document word number；

Step D, described lexical item, word frequency and lexical item document frequency and document word number are stored and are formed primary text to Amount；

Step E, carries out information gain calculating to described primary text vector, obtains the information gain amount of each lexical item, according to The multiple vocabulary meeting preset requirement are formed the reference vector of feature selection by the size sequence of described information gain amount；

Step F, carries out dimensionality reduction by pending text according to described reference vector, forms the text vector after dimensionality reduction.

Wherein, step E carries out information gain calculate and include step:

Using every text as a classification, using the lexical item in text as feature, calculate information according to equation below and increase Beneficial amounts

\begin{matrix} I G (T) = - Σ_{i = 1}^{n} P (C_{i}) \times \log_{2} P (C_{i}) + P (t) \times Σ_{i = 1}^{n} P (C_{i} | t) \times \log_{2} P (C_{i} | t) + P (\overset{&OverBar;}{t}) \\ \times Σ_{i = 1}^{n} P (C_{i} | \overset{&OverBar;}{t}) \times \log_{2} P (C_{i} | \overset{&OverBar;}{t}) \end{matrix}

Wherein, N represents total classification number, P (C_i) represent classification C_iThe probability occurred, it is general that P (t) represents that feature (T) occurs Rate,Represent feature (T) absent variable probability, P (C_i| t), represent that text comprises feature (T) and belongs to classification C_iProbability.

Wherein, in step E,Wherein DF_TRepresent the document frequency of feature (T)；

P (\overset{&OverBar;}{t}) = 1 - P (t);

Wherein TF_iRepresent the frequency of occurrences of each lexical item；

P (C_{i} | \overset{&OverBar;}{t}) = \frac{P (\overset{&OverBar;}{t} | C_{i}) \times P (C_{i})}{P (\overset{&OverBar;}{t})} .

The embodiment of the present invention also provides for a kind of text category feature and selects to carry out the device of dimensionality reduction, including acquisition module, participle Module, statistical module, vector module, information gain computing module and dimensionality reduction module；

Acquisition module, for obtaining the details of pending data source text and storing；

Word-dividing mode, is used for using HanLP that described data source text carries out participle and obtains multiple lexical item, remove institute's predicate Stop words in Xiang；

Statistical module, is used for adding up word frequency (frequency of occurrences of each lexical item) and lexical item document frequency and document word number；

Vector module, for storing lexical item, word frequency and lexical item document frequency and document word number and formed primary text Vector；

Information gain computing module, for primary text vector is carried out information gain calculating, obtains the information of each lexical item Amount of gain, according to information gain amount size sort, by meet preset requirement multiple vocabulary formed feature selection benchmark to Amount；

Dimensionality reduction module, for pending text is carried out dimensionality reduction according to described reference vector, forms the text after dimensionality reduction Vector.

Wherein, described information gain computing module, it is used for:

\begin{matrix} I G (T) = - Σ_{i = 1}^{n} P (C_{i}) \times \log_{2} P (C_{i}) + P (t) \times Σ_{i = 1}^{n} P (C_{i} | t) \times \log_{2} P (C_{i} | t) + P (\overset{&OverBar;}{t}) \\ \times Σ_{i = 1}^{n} P (C_{i} | \overset{&OverBar;}{t}) \times \log_{2} P (C_{i} | \overset{&OverBar;}{t}) \end{matrix}

Wherein, N represents total classification number, P (C_i) represent classification C_iThe probability occurred, it is general that P (t) represents that feature (T) occurs Rate,Represent feature (T) absent variable probability, P (C_i| t), represent that text comprises feature (T) and belongs to classification C_iProbability；

Wherein DF_TRepresent the document frequency of feature (T)；

P (\overset{&OverBar;}{t}) = 1 - P (t);

Wherein TF_iRepresent the frequency of occurrences of each lexical item；

P (C_{i} | \overset{&OverBar;}{t}) = \frac{P (\overset{&OverBar;}{t} | C_{i}) \times P (C_{i})}{P (\overset{&OverBar;}{t})} .

Embodiments provide a kind of method and apparatus selecting based on text category feature and carrying out dimensionality reduction, pass through HanLP participle, remove stop words, lexical item is carried out information gain calculating as feature, according to steps such as information gain amount sequences Obtain reference vector, further according to reference vector document carried out dimension-reduction treatment, a kind of realize based on information gain algorithm File characteristics dimension-reduction treatment method, reduces the dimension of file characteristics word set, decreases what hundreds of thousands dimensional feature word set was caused Computation burden.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet that the present invention selects to carry out one embodiment of method of dimensionality reduction based on text category feature；

The schematic flow sheet of the embodiment carrying out text feature selection that Fig. 2 position embodiment of the present invention two provides.

Detailed description of the invention

Embodiments provide a kind of method and apparatus selecting based on text category feature and carrying out dimensionality reduction, be a kind of base In the algorithm that the text category feature of information gain (Information Gain, IG) selects, extract can have by the text Representative, maximally effective feature, to reduce data set dimension.In information gain, the criterion of importance is seen exactly Feature can bring how much information for categorizing system, and the information brought is the most, and this feature is the most important.

The embodiment of the present invention uses HanLP participle technique that text is carried out participle, and its principle is that structure one is sufficiently large Comprise the dictionary of all Chinese words being likely to occur, it is judged that whether pending Chinese text Chinese character string occurs in dictionary In, once find, identify a word, and this word is split from Chinese character string, until Chinese character string is divided complete.HanLP Possess perfect in shape and function, performance efficiency, framework is clear, language material is stylish, the feature that can customize.While abundant function is provided, HanLP internal module adheres to that lower coupling, model adhere to that inertia loads, service adheres to that static offer, dictionary adhere to that plaintext is issued, and makes With very convenient, carry some language material handling implements simultaneously, help user to train the language material of oneself.But the shortcoming of its maximum is exactly The accuracy rate performed all places one's entire reliance upon dictionary, needs to be updated dictionary.

Information Gain Method occurs by feature or occurs without judging that text generic is provided in the text The size of quantity of information.Filtration problem is used for measure whether a known feature comes across in certain theme related text for this Theme prediction has how much information to contribute.By calculate information gain can obtain those frequencies of occurrences in positive example sample high and The feature that in non-positive example sample, the frequency of occurrences is low.Information gain relates to more mathematical theory and complicated entropy theory formula, this It is the quantity of information that whole classification can be provided by that inventive embodiments is defined as certain characteristic item, does not consider the entropy of any feature and examines Consider the difference of the entropy after this feature.The embodiment of the present invention, according to training data, calculates the information gain of each characteristic item, deletes The item that information gain is the least, remaining sorts from big to small according to information gain and screens.

Embodiment one

Specifically, shown in Figure 1, the method comprising the steps of:

Step S110, obtains the details of pending data source text and stores.

Obtain data source text details, and store in HDFS, and retain backup, for subsequent survey or data Review use.

Step S111, uses HanLP that described data source text is carried out participle and obtains multiple lexical item, remove in described lexical item Stop words.

The effective information of one text is typically mainly made up of notional words such as noun, adjective, verb, measure word, and which it belongs to Individual classification is also mainly distinguished by these notional words, and also has some words all frequently occurred in all texts and do not have reality to contain What text classification contributed by the function word of justice almost without.These stop words do not have a great deal of practical meanings, but at literary composition The most often occur in Ben, if not removed, two diverse texts of content may be made because these are substantial amounts of common There is information to distinguish, influence whether the Feature Selection stage below simultaneously, increase system-computed expense, finally affect classification The structure of device.Therefore, by disabling dictionary, after text is carried out word segmentation processing, the word that will be present in dictionary directly filters out.

Step S112, statistics word frequency, lexical item document frequency and document word number.

Use HanLP that text carries out participle, statistics word frequency and lexical item (Term) document frequency and document word number.Its In, word frequency is the frequency that each lexical item (Term) occurs in full text, and lexical item document frequency is that each lexical item is at a literary composition The frequency occurred in Dang, the lexical item quantity that document word number is comprised by a document.

Step S113, stores described lexical item, word frequency and lexical item document frequency and document word number and is formed primary text Vector.

The number of times (word frequency), the lexical item document frequency that lexical item, word are occurred store in memory database, form vectorization literary composition This, calculate read-write for information gain.

Step S114, carries out information gain calculating to described primary text vector, obtains the information gain amount of each lexical item, press Size according to described information gain amount sorts, and the multiple vocabulary meeting preset requirement are formed the reference vector of feature selection.

Text vector is carried out information gain calculating, according to the little sequence that contains much information, retains N number of vocabulary according to demand, make Be characterized the reference vector of selection, and all texts carried out dimensionality reduction according to reference vector, formed the text after final dimensionality reduction to Amount.

The mathematical definition of entropy: assume there is a variable X, its possible value has n kind, is x respectively₁,x₂,…,x_n, take every The probability planting value is P₁,P₂,…,P_n, then the entropy of variable X is defined as:

H (X) = - Σ_{i = 1}^{n} (P_{i} \times \log_{2} P_{i})

The entropy of categorizing system: for a categorizing system, classification C is variable, and its possible value is C₁,C₂,…,C_n, often The probability that individual classification occurs is P (C₁),P(C₂),…,P(C_n), wherein n represents categorical measure.The entropy of categorizing system is defined as:

H (C) = - Σ_{i = 1}^{n} P (C_{i}) \times \log_{2} P (C_{i})

Wherein: P (C_i) represent classification C_iThe probability occurred, can use classification C_iThe record quantity (number of documents) comprised Estimate divided by the total number of records (total number of documents).That is:

P (C_{i}) = \hat{P} (C_{i}) = \frac{N_{C_{i}}}{N}

Wherein, N represents the total number of records,Represent classification C_iThe record number comprised.

Conditional entropy: assume that feature X has the possible value (x of n kind₁,x₂,…,x_n), then in the case of given X, system Entropy is defined as:

H (C | X) = Σ_{i = 1}^{n} P (X = X_{i}) \times H (C | X = X_{i})

Wherein,

H (C | X = x_{i}) = - Σ_{j = 1}^{n} P (C_{j} | X = x_{i}) \times \log_{2} P (C_{j} | X = x_{i})

For information gain is for each feature, it is simply that see a feature (T), the when that system having it and do not has it Quantity of information is respectively how many, and difference between the two is exactly the quantity of information that this feature is brought to system, i.e. gain.

Feature (T) can be write as the condition after system entropy originally and fixed character (T) to the information gain that system is brought The difference of entropy:

IG (T)=H (C)-H (C | T)

In Text Classification System, the corresponding lexical item of feature (T), it only has two kinds of values " to occur " or " occurring without ". Represent that feature (T) occurs with t, useRepresent that feature (T) occurs without.

So:

H (C | T) = P (t) \times H (C | t) + P (\overset{&OverBar;}{t}) \times H (C | \overset{&OverBar;}{t})

Wherein: P (t) represents the probability that feature (T) occurs,Represent feature (T) absent variable probability.

This formula is further spread out:

H (C | t) = - Σ_{i = 1}^{n} P (C_{i} | t) \times \log_{2} P (C_{i} | t)

H (C | \overset{&OverBar;}{t}) = - Σ_{i = 1}^{n} P (C_{1} | \overset{&OverBar;}{t}) \times \log_{2} P (C_{i} | \overset{&OverBar;}{t})

So IG (T) can further spread out into:

\begin{matrix} I G (T) = - Σ_{i = 1}^{n} P (C_{i}) \times \log_{2} P (C_{i}) + P (t) \times Σ_{i = 1}^{n} P (C_{i} | t) \times \log_{2} P (C_{i} | t) + P (\overset{&OverBar;}{t}) \\ \times Σ_{i = 1}^{n} P (C_{i} | \overset{&OverBar;}{t}) \times \log_{2} P (C_{i} | \overset{&OverBar;}{t}) \end{matrix}

The feature selection of text, is to extract important lexical item in whole text collection, without the concept of classification, institute With needs, problem is carried out extensive, using every text as a classification.Now, the quantity of classification is equal to text collection Chinese version Quantity N.Assume the relevant parameter of information gain formula is estimated based on this kind.

Symbol description:

N, represents total textual data, i.e. total classification number；

P(C_i), represent classification C_iThe probability occurred, i.e. text D_iThe probability occurred, is equal to

P (t), represents feature (T) probability that occurs, uses and comprises the amount of text of feature (T) divided by total amount of text N, That is:Wherein DF_TRepresent the document frequency of feature (T)；

Represent feature (T) absent variable probability, equal to 1-P (t)；

P(C_i| t), represent that text comprises feature (T) and belongs to classification C_iProbability；Here, it is understood that there may be two kinds of estimation sides Formula:

Employing comprises feature (T) and belongs to classification C_iAmount of text divided by total textual data, value be 0 or

Launch by Bayesian formula,P(t|C_i) represent classification C_iMiddle feature (T) occurs Probability, i.e. feature (T) is in document D_iThe probability of middle appearance, usesWherein TF_iRepresent the frequency of occurrences of each lexical item； TF_TRepresent the frequency that each feature T occurs.

Represent that text comprises feature (T) and belongs to classification C_iProbability；Here, it is understood that there may be two kinds of estimation sides Formula:

Use and do not comprise feature (T) and belong to classification C_iAmount of text divided by total textual data, value be 0 or

Launch by Bayesian formula,Wherein

It is to be noted that

When estimating P (t), this value may be 1, and this will result inValue be 0 so thatCannot calculate.Use so P (t) is actualEstimate.

P(t|C_i) useEstimate, if TF_TValue be 0, the value being this estimation is 0 by this.So it is actual Use:Estimate.

Feature described in embodiments of the present invention refers to the lexical item of text.

Those skilled in the art technical scheme according to embodiments of the present invention can determine that each parameter defines, the embodiment of the present invention The most all enumerate.

Step S115, carries out dimensionality reduction by pending text according to described reference vector, forms the text vector after dimensionality reduction.

What the embodiment of the present invention one proposed carries out feature selection based on information gain algorithm to text, by feature to whole The importance of system, is ranked up screening, thus reaches the purpose of dimensionality reduction, alleviate computation burden feature.

Embodiment two

In the embodiment of the present invention two, select the same embodiment of main flow of the method carrying out dimensionality reduction based on text category feature One, wherein text feature selection flow process is shown in Figure 2, including step:

Step S210, obtains original text.

Step S211, obtains segmenter, uses segmenter that original text is carried out participle.

Step S212, obtains noun filter, uses noun filter that the text after participle carries out noun screening and obtains Name set of words.

Step S213, carries out document frequency statistics and is stored in redis.

Step S214, carries out word frequency statistics and is stored in redis.

Step S215, carries out document forward index.

Step S216, carries out IG calculating according to the result that step S213 and step S214 are added up.

Step S217, the Feature Words persistence that will obtain.

Embodiment three

The embodiment of the present invention three provides a kind of and selects to carry out the device of dimensionality reduction based on text category feature, including acquisition module, Word-dividing mode, statistical module, vector module, information gain computing module and dimensionality reduction module.

Wherein acquisition module, for obtaining the details of pending data source text and storing.

Word-dividing mode, is used for using HanLP that described data source text carries out participle and obtains multiple lexical item, remove institute's predicate Stop words in Xiang.

Statistical module, is used for adding up word frequency (frequency of occurrences of each lexical item) and lexical item document frequency and document word number.

Vector module, for storing described lexical item, word frequency and lexical item document frequency and document word number and form primary Text vector.

Information gain computing module, for primary text vector is carried out information gain calculating, obtains the information of each lexical item Amount of gain, according to information gain amount size sort, by meet preset requirement multiple vocabulary formed feature selection benchmark to Amount.

Dimensionality reduction module, for pending text is carried out dimensionality reduction according to reference vector, forms the text vector after dimensionality reduction.

The expression of text is abstracted as the space vector of feature word set, but original candidate feature word set is up to hundreds of thousands Dimension, on the one hand high-dimensional text representation except causing the burden in calculating, and on the other hand, bigger feature redundancy can cause The decline of classification performance, embodiments provides and a kind of carries out method and the dress of feature extraction based on information gain algorithm Put, reduce the dimension of feature word set, alleviate corresponding computation burden, eliminate redundancy feature and improve classification performance.

It should be noted that device or system embodiment in the embodiment of the present invention can be realized by software, it is possible to Realize in the way of by hardware or software and hardware combining.For hardware view, the hardware configuration framework of the embodiment of the present invention In structure, in addition to CPU, internal memory, network interface and nonvolatile memory, in embodiment, the equipment at device place leads to Often can also include other hardware, such as the forwarding chip etc. of responsible process message.As a example by implemented in software, as a logic Device in meaning, is that computer program instructions corresponding in nonvolatile memory is read by the CPU by its place equipment Formation is run in internal memory.

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention Within god and principle, any modification, equivalent substitution and improvement etc. done, within should be included in the scope of protection of the invention.

Claims

1. the method selecting to carry out dimensionality reduction based on text category feature, it is characterised in that include step:

Step A, obtains the details of pending data source text and stores；

Step B, carries out participle to described data source text and obtains multiple lexical item, remove the stop words in described lexical item；

Step D, stores described lexical item, word frequency and lexical item document frequency and document word number and is formed primary text vector；

Step E, carries out information gain calculating to described primary text vector, obtains the information gain amount of each lexical item, according to described The multiple vocabulary meeting preset requirement are formed the reference vector of feature selection by the size sequence of information gain amount；

The method selecting to carry out dimensionality reduction based on text category feature the most according to claim 1, it is characterised in that described step E carries out information gain calculate and include step:

Using every text as a classification, using the lexical item in text as feature, calculate information gain amount according to equation below

\begin{matrix} I G (T) = - Σ_{i = 1}^{n} P (C_{i}) \times \log_{2} P (C_{i}) + P (t) \times Σ_{i = 1}^{n} P (C_{i} | t) \times \log_{2} P (C_{i} | t) + P (\overset{&OverBar;}{t}) \\ \times Σ_{i = 1}^{n} P (C_{i} | \overset{&OverBar;}{t}) \times \log_{2} P (C_{i} | \overset{&OverBar;}{t}) \end{matrix}

Wherein, N represents total classification number, P (C_i) represent classification C_iThe probability occurred, P (t) represents the probability that feature (T) occurs,Represent feature (T) absent variable probability, P (C_i| t), represent that text comprises feature (T) and belongs to classification C_iProbability.

The method selecting to carry out dimensionality reduction based on text category feature the most according to claim 2, it is characterised in that described step In E,Wherein DF_TRepresent the document frequency of feature (T)；

P (\overset{&OverBar;}{t}) = 1 - P (t);

Wherein TF_iRepresent the frequency of occurrences of each lexical item；

P (C_{i} | \overset{&OverBar;}{t}) = \frac{P (\overset{&OverBar;}{t} | C_{i}) \times P (C_{i})}{P (\overset{&OverBar;}{t})} .

4. a text category feature selects to carry out the device of dimensionality reduction, it is characterised in that include acquisition module, word-dividing mode, statistics Module, vector module, information gain computing module and dimensionality reduction module；

Described acquisition module, for obtaining the details of pending data source text and storing；

Described word-dividing mode, is used for using HanLP that described data source text carries out participle and obtains multiple lexical item, remove institute's predicate Stop words in Xiang；

Described statistical module, is used for adding up word frequency, lexical item document frequency and document word number；

Described vector module, for storing described lexical item, word frequency and lexical item document frequency and document word number and form primary Text vector；

Described information gain computing module, for described primary text vector is carried out information gain calculating, obtains each lexical item Information gain amount, sorts according to the size of described information gain amount, and the multiple vocabulary meeting preset requirement are formed feature selection Reference vector；

Described dimensionality reduction module, for pending text is carried out dimensionality reduction according to described reference vector, forms the text after dimensionality reduction Vector.

Text category feature the most according to claim 4 selects to carry out the device of dimensionality reduction, it is characterised in that

Described information gain computing module, is used for:

\begin{matrix} I G (T) = - Σ_{i = 1}^{n} P (C_{i}) \times \log_{2} P (C_{i}) + P (t) \times Σ_{i = 1}^{n} P (C_{i} | t) \times \log_{2} P (C_{i} | t) + P (\overset{&OverBar;}{t}) \\ \times Σ_{i = 1}^{n} P (C_{i} | \overset{&OverBar;}{t}) \times \log_{2} P (C_{i} | \overset{&OverBar;}{t}) \end{matrix}

Wherein, N represents total classification number, P (C_i) represent classification C_iThe probability occurred, P (t) represents the probability that feature (T) occurs,Represent feature (T) absent variable probability, P (C_i| t), represent that text comprises feature (T) and belongs to classification C_iProbability；

Wherein DF_TRepresent the document frequency of feature (T)；

P (\overset{&OverBar;}{t}) = 1 - P (t);

Wherein TF_iRepresent the frequency of occurrences of each lexical item；

P (C_{i} | \overset{&OverBar;}{t}) = \frac{P (\overset{&OverBar;}{t} | C_{i}) \times P (C_{i})}{P (\overset{&OverBar;}{t})} .