CN107665248A

CN107665248A - File classification method and device based on deep learning mixed model

Info

Publication number: CN107665248A
Application number: CN201710864498.XA
Authority: CN
Inventors: 杨振宇; 庞雪
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2017-09-22
Filing date: 2017-09-22
Publication date: 2018-02-06

Abstract

The invention discloses a kind of file classification method and device based on deep learning mixed model, methods described includes：Text data is obtained, and the text data is pre-processed；Feature learning is carried out to text data based on the deep learning mixed model that noise reduction autocoder and depth confidence network are combined；The feature obtained according to study, is classified using Softmax regression models.The deep learning mixed model of the present invention has very strong adaptability, and the sorting technique disclosure satisfy that the classification demand of most of different texts.

Description

File classification method and device based on deep learning mixed model

Technical field

The invention belongs to depth network model field, more particularly to a kind of text classification based on deep learning mixed model Method and apparatus.

Background technology

With the continuous development in Information technology epoch, electronic text information quantity increases sharply, it is meant that the big data epoch Arriving.So in this context, how to carry out effective tissue to these substantial amounts of text messages becomes especially to weigh with utilizing Will.Technical foundation of the text classification as fields such as information retrieval, digital library, information filterings, there is very big application Prospect.Text representation is always the key problem of natural language processing field, and there is dimension disaster, number for traditional text representation According to it is sparse the problems such as, oneself through turn into a large amount of natural language processing mission performances improve bottleneck.

Text representation is always the key problem of natural language processing field, and there is dimension calamity for traditional text representation The problems such as hardly possible, Sparse, the obstacle that oneself improves through turning into a large amount of natural language processing mission performances.Deep learning model has The deep structure of multilayered nonlinear mapping, and data dimension can be effectively reduced by multilayer neural network come training pattern Number.Meanwhile deep learning can also utilize less parameter to complete complicated function approximation, good feature can be carried out to data Study.Most of all, deep neural network can be launched into BP neural network after training, BP nerve nets can be utilized The function of network reverse propagated error optimizes to the performance of whole network.But the type of neutral net, the number of plies, different god Combination through network etc., resulted in be currently based on neutral net disaggregated model it is various informative, for which model or The mixed model of which model of person possesses more excellent classification performance, or possesses stronger adaptability, in the prior art also not It is expressly recited.

Therefore, seek the sorting technique that a kind of performance is more preferable, universality is stronger, be still that those skilled in the art need at present The technical problem urgently to solve.

The content of the invention

To overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of text based on deep learning mixed model point Class method, the feature learning of text is carried out with reference to the noise reduction autocoder in deep learning model and depth confidence network, Classification is finally performed using Softmax regression models, the deep learning mixed model has very strong adaptability, the classification side Method disclosure satisfy that the classification demand of most of different texts.

To achieve the above object, the present invention adopts the following technical scheme that：

A kind of file classification method based on deep learning mixed model, comprises the following steps：

Step 1：Text data is obtained, and the text data is pre-processed；

Step 2：The deep learning mixed model being combined based on noise reduction autocoder and depth confidence network is to text Data carry out feature learning；

Step 3：The feature obtained according to study, is classified using Softmax regression models.

Further, the pretreatment includes：

(1) classifying text data are treated tentatively to be filtered；

(2) text data is segmented, and the further filtering text data on the basis of participle；

(3) character representation is carried out to text using VSM models.

Further, the preliminary filtering includes removing form, the punctuation mark in text；Further filtering includes removing Stop-word, and filtered according to part of speech, retain verb and noun.

Further, the deep learning mixed model that the noise reduction autocoder and depth confidence network are combined is by dropping Make an uproar autocoder and depth confidence cascade forms, wherein, the output of the noise reduction autocoder is as depth confidence The input of network.

Further, the noise reduction autocoder is arranged to two layers, and input data is mapped to a higher-dimension by first layer Data are compressed processing by input of the space as the second layer, the input using obtained data as depth confidence network, institute Depth confidence network is stated as five layers.

Further, the process of the noise reduction autocoder training is as follows：

First, the first layer coderDimensionality reduction is carried out to high dimensional data, input vector x is destroyed and obtainsBy activating letter Number and linear transformation, obtain implicit coding result y；

Wherein, S_fIt is that non-linear activation primitive its expression formula is：

Then, implicit layer data y to being mapped as reconstructing z by the second layer decoder f (y)；

Z=f (y)=s_g(W'y+a_z)

Wherein, s_gIt is the activation primitive of decoder, W'=W^T, be sigmoid functions W transposition, a_yAnd a_zBias to Amount；

Iteration is performed, parameter θ={ W, a for making reconstructed error minimum is found on training sample set_y,a_z, according to following public affairs Formula updates W, a_yAnd a_z：

WhereinIt is learning rate.

Further, the expression formula of the reconstructed error is:

Wherein,N is training set sample number, X_iRefer to i-th of input, Z_iRefer to the data after i-th of decoding and reconstituting.

Further, the training process of the depth confidence network is as follows：

It is first according to successively greedy method and carries out pre-training, then network is entered using the function of BP error back propagations Row tuning；Wherein, during pre-training, the computational methods of weight are as follows：

To each record X in training set, X is attached to aobvious layer v⁰, calculate the probability that it makes hidden neuron be activated：Subscript in formula is used to distinguish different vectors, and subscript is used to distinguish same vectorial difference Dimension；Then, a sample h is extracted from the probability distribution calculated⁽⁰⁾~P (h⁽⁰⁾|v⁽⁰⁾), use h⁽⁰⁾The aobvious layer of reconstructEqually extract a sample v of aobvious layer⁽¹⁾~P (v⁽¹⁾|h⁽⁰⁾), then with after reconstruct Aobvious layer neuron calculates the probability that hidden neuron is activated,Weight is updated as the following formula：W ←W+λ(P(h⁽⁰⁾=1 | v⁽⁰⁾)v^(0)T-P(h⁽¹⁾=1 | v⁽¹⁾)v^(1)T)。

According to the second purpose of the invention, present invention also offers a kind of text classification dress based on deep learning mixed model Put, including memory, processor and storage are on a memory and the computer program that can run on a processor, the processor The text classification based on deep learning mixed model as described in claim any one of 1-8 is realized when performing described program.

According to the 3rd purpose of the invention, present invention also offers a kind of computer-readable recording medium, is stored thereon with meter Calculation machine program, when the program is executed by processor perform as described in claim any one of 1-8 based on deep learning hybrid guided mode The text classification of type.

Beneficial effects of the present invention

1st, the feature learning mixed model of a kind of new text classification proposed by the present invention, select noise reduction autocoder and Two kinds of models of depth confidence network are cascaded, and the input using the output of noise reduction autocoder as depth confidence network, are led to Cross test result indicates that, the deep learning mixed model has very strong adaptability, can meet the classification of most of different texts It is required that and model it is simple, improve text classification performance.

2nd, file classification method provided by the invention can be applied to the excavation of any text according to demand, practical, It is easy to spread, such as applied to Web page classifying, microblog emotional analysis, user comment mining analysis, information retrieval, digital publication Shop, information filtering etc..

Brief description of the drawings

The Figure of description for forming the part of the application is used for providing further understanding of the present application, and the application's shows Meaning property embodiment and its illustrate be used for explain the application, do not form the improper restriction to the application.

Fig. 1 is the implementation process figure of the file classification method of the invention based on deep learning mixed model；

Fig. 2 is influence of the different destructive rates to classification in noise reduction autocoder；

Fig. 3 is that influence of the node layer to classification results is hidden in depth confidence network；

Fig. 4 is influence of the different classifications algorithm to classification accuracy rate.

Embodiment

It is noted that described further below is all exemplary, it is intended to provides further instruction to the application.It is unless another Indicate, all technologies used herein and scientific terminology are with usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.

It should be noted that term used herein above is merely to describe embodiment, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singulative It is also intended to include plural form, additionally, it should be understood that, when in this manual using term "comprising" and/or " bag Include " when, it indicates existing characteristics, step, operation, device, component and/or combinations thereof.

In the case where not conflicting, the feature in embodiment and embodiment in the application can be mutually combined.

Embodiment one

Present embodiment discloses a kind of file classification method based on deep learning mixed model, as shown in figure 1, including with Lower step：

Step 1：Text data to be sorted is obtained, and the text data is pre-processed；

The pretreatment specifically includes：

(1) classifying text data are treated tentatively to be filtered；

Specifically, some otiose information in text document are removed.Such as some forms, punctuation mark in document Deng.

Text is segmented using NLPIR Chinese word segmentation systems herein.The stop-word in document is removed after participle.Use The ICTCLAS systems of the Chinese Academy of Sciences carry out part-of-speech tagging, remove the useless feature of such as auxiliary word, preposition and predictive ability, only extract Verb and noun are as Feature Words.

(3) character representation is carried out to text using VSM models.

Specifically, character representation is input to one by noise reduction autocoder (DAE) and depth confidence network (DBN) phase With reference to deep learning mixed model (DABN) in carry out feature learning.

The feature learning stage detailed process is：

First by other representations of two layers DAE study initial characteristicses.First layer noise reduction autocoder will input number According to the space for being mapped to a higher-dimension, it is set to possess stronger separability；The input of second layer noise reduction autocoder is first Data are compressed processing by the output of layer.Then further feature extraction is carried out to text using 5 layers of DBN, by DAE's Input data of the output data as DBN.

The process of the noise reduction autocoder training is as follows：

EncoderDimensionality reduction operation is carried out to high dimensional data, input vector x is destroyed first and obtains

It is then enter into encoderBy activation primitive and the sequence of operations of linear transformation, finally give Implicit coding result Y.Decoder f (y) is expressed as such as minor function to implicit layer data is mapped back into reconstruct z:

Z=f (y)=s_g(W'y+a_z)

Wherein S_fIt is that non-linear activation primitive its expression formula is：s_gIt is swashing for decoder Function living, herein using sigmoid functions W'=W^T, be W transposition, a_yAnd a_zIt is bias vector.

DAE training process is searching parameter θ={ W, a on training sample set_y,a_zMinimum reconstructed error, reconstruct The expression formula of error is:Wherein, L is reconstruct error function.

It is using the closely related loss function of intersection, expression formula herein：Wherein, N is instruction Practice collection sample number, X_iRefer to i-th of input, Z_iRefer to the data after i-th of decoding and reconstituting.DAE is used in each iterative process Following formula updates weight matrix:WhereinIt is learning rate.a_yUpdate mode be：a_z Update mode be：

The training process of the depth confidence network is as follows：

DBN training includes two processes of pre-training and tuning, first carries out pre-training according to successively greedy method：

1st, first have to train up first RBM；

2nd, first RBM offset and weight is fixed, the input quantity using its hidden layer as second RBM；

3rd, after having trained up second RBM, second RBM is stacked on to first RBM top；

4th, 1,2,3 steps are arbitrarily multiple more than repeating；

The 5th, if training set data has label, then, will also be with except aobvious layer neuron when training the RBM of first There is the neuron for representing tag along sort to be trained together.

The weighing computation method is as follows：

To each record X in training set, X is attached to aobvious layer v⁰, calculate the probability that it makes hidden neuron be activated：Subscript in formula is used to distinguish different vectors, and subscript is used to distinguish same vectorial difference Dimension.Then, a sample h is extracted from the probability distribution calculated⁽⁰⁾~P (h⁽⁰⁾|v⁽⁰⁾), use h⁽⁰⁾The aobvious layer of reconstructEqually extract a sample v of aobvious layer⁽¹⁾~P (v⁽¹⁾|h⁽⁰⁾), then with after reconstruct Aobvious layer neuron calculates the probability that hidden neuron is activated,Weight is updated as the following formula：W ←W+λ(P(h⁽⁰⁾=1 | v⁽⁰⁾)v^(0)T-P(h⁽¹⁾=1 | v⁽¹⁾)v^(1)T)。

We have trained according to successively greedy method to DBN, and the RBM of preceding layer error can the gradual RBM of layer backward Transmit, be not modified.We can carry out tuning using the function of BP error back propagations to network.Utilize DBN power The weights of value initialization BP neural network, make classifying quality more preferable.

Softmax regression models are that (logistic returns solution for extension of the logistic regression models in more classification problems Certainly be two classification problems).Input using the text feature that deep learning mixed model exports as Softmax regression models, To text classification.

Alternatively, this method also includes step 4：The classification performance is evaluated.Specifically, the performance of text classification Evaluation index is mainly recall rate, accuracy rate and comprehensive three kinds of accuracy.

Embodiment two

The purpose of the present embodiment is to provide a kind of computing device.

A kind of document sorting apparatus based on deep learning mixed model, including memory, processor and it is stored in storage Following steps are realized on device and the computer program that can run on a processor, during the computing device described program, including：

Step 1：Text data is obtained, and the text data is pre-processed；

Embodiment three

The purpose of the present embodiment is to provide a kind of computer-readable recording medium.

A kind of computer-readable recording medium, is stored thereon with computer program, for based on deep learning mixed model Text classification, the program performs following steps when being executed by processor：

Step 1：Text data is obtained, and the text data is pre-processed；

Each step being related in above example two and three is corresponding with embodiment of the method one, and embodiment can be found in The related description part of embodiment one.Term " computer-readable recording medium " is construed as including one or more instruction set Single medium or multiple media；Any medium is should also be understood as including, any medium can be stored, encodes or held Carry for the instruction set by computing device and make the either method in the computing device present invention.

Experimental result

The experimental data set chosen herein comes from Fudan University's corpus (Fudan University's computerized information and state of technology system Border database hub natural language processing group), provided by Fudan University Li Rong lands.Selection wherein 10 classifications, totally 4000 Document is tested, and each classification chooses 400 samples, and training corpus and testing material are according to 1:1 ratio divides.

1. DAE is respectively 2000-1000 two layers totally using nodes, iteration 2000 times, and chooses destructive rate and be respectively 0.1,0.14,0.18,0.22,0.26,0.3 classification experiments are carried out.First every layer of matrix is represented to carry out being randomized the behaviour set to 0 Make, then be trained, first layer training terminates the next layer of rear retraining.

2. the nodes of DBN neutral nets are respectively 5 layers of 600-500-300-100-15, BP tunings number is 300.

3. carry out svm classifier experiment using libsvm tool boxes；The knnclassify graders carried using matlab are entered Row KNN classification experiments.

The Performance Evaluating Indexes of text classification are mainly recall rate, accuracy rate and comprehensive three kinds of accuracy.It is assumed that：Classification a_i Classification results in, the number of samples for being correctly divided into such is b, and it is c that mistake, which is incorporated into as such number of samples, by such mistake The number of samples incorporated into as its class is d, altogether comprising C classes.

Recall rate：Recall=b/ (b+c), measurement is recall ratio.

Accuracy rate：Precision=b/ (b+d), measurement is precision ratio.

Table 1 have recorded the contrast of different sorting algorithm classification results accuracys rate and recall rate：

Accompanying drawing 2 is influence of the different destructive rates to classification in noise reduction autocoder, it will be seen that DAE from figure Influence to classification be in class parabolic shape after introducing destructive rate, destructive rate for 0.08 and 0.24 when classification accuracy rate most It is small, while when destructive rate is 0.16, classification accuracy rate is maximum.

Accompanying drawing 3 is that first layer hides influence of the node layer to classification accuracy rate in depth confidence network.With DBN hidden layers Nodes are continuously increased, and its text classification recall rate and accuracy rate are constantly increasing at the beginning, but big in hidden layer unit number It is on a declining curve after 600.Hidden layer unit number is primarily due to all to be unfavorable for expressing data characteristics too much or very little.When When the interstitial content of hidden layer is 600, to being up to 91%, recall rate also reaches and is up to the rate of accuracy reached of text classification 89%.

Accompanying drawing 4 is influence of the different classifications algorithm to classification accuracy rate.From table 1 and Fig. 4 we can see that it is carried herein The classifying quality of the deep learning mixed model gone out is better than traditional sorting algorithm.

Test result indicates that above-mentioned embodiment is only the specific case of the present invention, scope of patent protection of the invention Including but not limited to above-mentioned embodiment, a kind of claims of the method for any text classification for meeting the present invention And the appropriate change or replacement that the those of ordinary skill of any technical field is done to it, it should all fall into patent of the invention Protection domain.

It will be understood by those skilled in the art that each module or each step of the invention described above can be filled with general computer Put to realize, alternatively, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored Performed in the storage device by computing device, either they are fabricated to respectively each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.The present invention be not restricted to any specific hardware and The combination of software.

Although above-mentioned the embodiment of the present invention is described with reference to accompanying drawing, model not is protected to the present invention The limitation enclosed, one of ordinary skill in the art should be understood that on the basis of technical scheme those skilled in the art are not Need to pay various modifications or deformation that creative work can make still within protection scope of the present invention.

Claims

1. a kind of file classification method based on deep learning mixed model, it is characterised in that comprise the following steps：

Step 1：Text data is obtained, and the text data is pre-processed；

2. the file classification method as claimed in claim 1 based on deep learning mixed model, it is characterised in that the pre- place Reason includes：

(1) classifying text data are treated tentatively to be filtered；

(3) character representation is carried out to text using VSM models.

3. the file classification method as claimed in claim 2 based on deep learning mixed model, it is characterised in that described preliminary Filtering includes removing form, punctuation mark in text；Further filtering includes removing stop-word, and is carried out according to part of speech Filter, retain verb and noun.

4. the file classification method as claimed in claim 1 based on deep learning mixed model, it is characterised in that the noise reduction The deep learning mixed model that autocoder and depth confidence network are combined is by noise reduction autocoder and depth confidence net Network cascade forms, wherein, the input exported as depth confidence network of the noise reduction autocoder.

5. the file classification method as claimed in claim 4 based on deep learning mixed model, it is characterised in that the noise reduction Autocoder is arranged to two layers, and input data is mapped to a higher dimensional space as the input of the second layer, logarithm by first layer According to processing, the input using obtained data as depth confidence network is compressed, the depth confidence network is five layers.

6. the file classification method as claimed in claim 4 based on deep learning mixed model, it is characterised in that the noise reduction The process of autocoder training is as follows：

First, the first layer coderDimensionality reduction is carried out to high dimensional data, input vector x is destroyed and obtainsBy activation primitive with And linear transformation, obtain implicit coding result y；

Z=f (y)=s_g(W'y+a_z)

Wherein, s_gIt is the activation primitive of decoder, W'=W^T, be sigmoid functions W transposition, a_yAnd a_zIt is bias vector；

Iteration is performed, parameter θ={ W, a for making reconstructed error minimum is found on training sample set_y,a_z, according to below equation more New W, a_yAnd a_z：

WhereinIt is learning rate.

7. the file classification method as claimed in claim 6 based on deep learning mixed model, it is characterised in that the reconstruct The expression formula of error is:

Wherein,N is training set sample number, X_iRefer to i-th of input, Z_iRefer to Data after i-th of decoding and reconstituting.

8. the file classification method as claimed in claim 1 based on deep learning mixed model, it is characterised in that the depth The training process of confidence network is as follows：

It is first according to successively greedy method and carries out pre-training, then network is adjusted using the function of BP error back propagations It is excellent；Wherein, during pre-training, the computational methods of weight are as follows：

9. a kind of document sorting apparatus based on deep learning mixed model, including memory, processor and it is stored in memory Computer program that is upper and can running on a processor, it is characterised in that realized during the computing device described program as weighed Profit requires the text classification based on deep learning mixed model described in any one of 1-8.

10. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is by processor The text classification based on deep learning mixed model as described in claim any one of 1-8 is performed during execution.