CN105512104A - Dictionary dimension reducing method and device and information classifying method and device - Google Patents

Dictionary dimension reducing method and device and information classifying method and device Download PDF

Info

Publication number
CN105512104A
CN105512104A CN201510874528.6A CN201510874528A CN105512104A CN 105512104 A CN105512104 A CN 105512104A CN 201510874528 A CN201510874528 A CN 201510874528A CN 105512104 A CN105512104 A CN 105512104A
Authority
CN
China
Prior art keywords
dictionary
classification
word
information
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510874528.6A
Other languages
Chinese (zh)
Inventor
张昊
朱频频
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhizhen Intelligent Network Technology Co Ltd
Original Assignee
Shanghai Zhizhen Intelligent Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Zhizhen Intelligent Network Technology Co Ltd filed Critical Shanghai Zhizhen Intelligent Network Technology Co Ltd
Priority to CN201510874528.6A priority Critical patent/CN105512104A/en
Publication of CN105512104A publication Critical patent/CN105512104A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a dictionary dimension reducing method and device and an information classifying method and device. The dictionary dimension reducing method comprises the steps that preprocessing is conducted on linguistic data obtained from a question and answer log to obtain text data; word segmentation processing is conducted on the text data to obtain multiple linguistic data words; filtration processing is conducted on the linguistic data words, and a dictionary containing multiple keywords is obtained; information classifications involved in the linguistic data are counted according to the question and answer log, the information entropy of each keyword in the dictionary is calculated, and the keywords of which the information entropy is smaller than the information entropy threshold are deleted from the dictionary, wherein the information entropy represents the appearance probability of the keywords in all the information classifications. By means of the dictionary dimension reducing method and device and the information classifying method and device, the words which are useless to classification can be rapidly filtered to conduct dimension reduction on the dictionary, and for the dictionary subjected to dimension reduction, a good accuracy rate of classification results is achieved.

Description

Dictionary dimension reduction method and device, information classification approach and device
Technical field
The present invention relates to technical field of information processing, particularly relate to a kind of dictionary dimension reduction method and device, information classification approach and device.
Background technology
At present, in natural language processing process, often need first text to be assigned in corresponding processing module, improve the efficiency performed.Classify as text describes in classifying content, text emotion classification, advertisement classification, Spam Filtering System.Need to build dictionary, for carrying out vectorization to content of text in these classification process.Owing to being not that each word occurred can both have an impact for classification, therefore needing to make the dictionary of generation the smaller the better as far as possible, thus effectively reduce the complexity of calculating.
In the prior art, dimension reduction method based on SVD, LDA, PCA all realizes dimensionality reduction effect based on matrix decomposition, its accuracy rate is higher, but the efficiency of decomposing due to large matrix is lower, so utilize said method dimensionality reduction to need the time of at substantial, be also difficult to reach optimum result by repeatedly tuning.
Summary of the invention
In view of the above problems, the present invention is proposed to provide a kind of overcoming the problems referred to above or the dictionary dimension reduction method solved the problem at least in part and device, information classification approach and device.
The invention provides a kind of dictionary dimension reduction method, comprising:
Pre-service is carried out to the language material obtained from Question and Answer log, obtains text data;
Word segmentation processing is carried out to text data, obtains multiple language material word;
Filtration treatment is carried out to language material word, obtains the dictionary comprising multiple keyword;
According to the information classification that Question and Answer log statistics language material relates to, calculate the information entropy of each keyword in dictionary, keyword information entropy being less than information entropy threshold value is deleted from dictionary, and wherein, information entropy represents the probability that this keyword occurs in each information classification.
The invention provides a kind of information classification approach, comprising: above-mentioned dictionary dimension reduction method.
Present invention also offers a kind of dictionary dimensionality reduction device, comprising:
Pretreatment module, for carrying out pre-service to the language material obtained from Question and Answer log, obtains text data;
Word-dividing mode, for carrying out word segmentation processing to text data, obtains multiple language material word;
Filtering module, for carrying out filtration treatment to language material word, obtains the dictionary comprising multiple keyword;
Computing module, for the information classification related to according to Question and Answer log statistics language material, calculate the information entropy of each keyword in dictionary, keyword information entropy being less than information entropy threshold value is deleted from dictionary, wherein, information entropy represents the probability that this keyword occurs in each information classification.
The invention provides a kind of information sorting device, comprising: above-mentioned dictionary dimensionality reduction device.
Beneficial effect of the present invention is as follows:
By utilizing the information entropy fast filtering candidate word of word in different classes of, dimensionality reduction is carried out to the dictionary built, solve the time that dictionary dimension reduction method of the prior art needs at substantial, the problem of optimal result can not be reached by repeatedly tuning, can filter fast and fall dictionary the useless word of classification, the dictionary after dimensionality reduction has good accuracy rate for classification results.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Fig. 1 is the process flow diagram of the dictionary dimension reduction method of the embodiment of the present invention;
Fig. 2 is the process flow diagram of the detailed process of the dictionary dimension reduction method of the embodiment of the present invention;
Fig. 3 is the structural representation of the dictionary dimensionality reduction device of the embodiment of the present invention.
Embodiment
Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
The time of at substantial is needed in order to solve dictionary dimension reduction method of the prior art, the problem of optimal result can not be reached by repeatedly tuning, the invention provides a kind of dictionary dimension reduction method and device and information classification approach and device, below in conjunction with accompanying drawing and embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, do not limit the present invention.
Embodiment of the method one
According to embodiments of the invention, provide a kind of dictionary dimension reduction method, Fig. 1 is the process flow diagram of the dictionary dimension reduction method of the embodiment of the present invention, and as shown in Figure 1, the dictionary dimension reduction method according to the embodiment of the present invention comprises following process:
Step 101, carries out pre-service to the language material obtained from Question and Answer log, obtains text data; In a step 101, pre-service comprises: be text formatting by the uniform format of corpus data, filters one or more in dirty word, sensitive word and stop words, and the text data after filtering is divided into multirow according to punctuate.Such as, above-mentioned punctuate can be question mark, exclamation, branch or fullstop etc., that is, the text data after filtration can be divided into multirow according to question mark, exclamation, branch or fullstop.
Step 102, carries out word segmentation processing to text data, obtains multiple language material word; In a step 102, word segmentation processing adopts one or more in the two-way maximum matching method of dictionary, viterbi method, HMM method and CRF method.
Step 103, carries out filtration treatment to language material word, obtains the dictionary comprising multiple keyword; In step 103, filtration treatment can adopt any one or two kinds of modes following:
Mode one: filter language material word according to part of speech, retains noun, verb and adjective;
Mode two: filter language material word according to the frequency, retain the language material word that the frequency is greater than frequency threshold value, wherein, the frequency refers to the frequency that language material word occurs in corpus data or number of times.
In the present embodiment, first according to part of speech, language material word is filtered, only retain noun, verb and adjective, remove the language material word of other part of speech; Then according to the frequency, the noun retained, verb and adjective are filtered, retain the language material word that the frequency is greater than frequency threshold value, thus for the frequency is greater than the noun of frequency threshold value, verb and adjective in dictionary.
In other embodiments of the invention, only can filter according to part of speech, also can only filter according to the frequency, can also first carry out filtering filtering according to part of speech according to the frequency, it be all within protection scope of the present invention again.
Step 104, according to the information classification that Question and Answer log statistics language material relates to, calculate the information entropy of each keyword in dictionary, keyword information entropy being less than information entropy threshold value is deleted from dictionary, wherein, information entropy represents the probability that this keyword occurs in each information classification.
Preferably, different according to the purposes of dictionary, the information classification in step 104 can intention classification involved by language material, and wherein, above-mentioned intention classification comprises: weather, shopping, work, tourism etc.; Certainly, intention classification is only a kind of mode classification in information classification, and different according to the user of dictionary, information classification also changes thereupon.
When information classification can intention classification involved by language material, the information entropy calculating keyword in dictionary comprises: calculate the probability that in dictionary, each keyword occurs in each intention classification.
The computing formula of information entropy is: H (X)=-Σ p (x i) logp (x i), wherein, H (X) represents the information entropy of keyword, p (x i) represent the probability that keyword occurs in i-th intention classification, i=1,2 ..., n, n are the number of intention classification.
Below in conjunction with accompanying drawing, the technique scheme of the embodiment of the present invention is described in detail.
In this example, filter fast the useless keyword of classification by the information entropy of keyword in difference intention classification thus dimensionality reduction carried out to dictionary, specifically comprising following process:
Step 1, is text formatting by the uniform format of the corpus data of acquisition, obtains text data, and filters invalid form, removes the words such as dirty word, sensitive word and stop words; By large punctuate (question mark, exclamation, branch and fullstop), preservation of embarking on journey is split to the language material after process.
Step 2, utilizes participle engine to carry out word segmentation processing to text data, obtains multiple language material word, and the two-way maximum matching method of dictionary, viterbi method, HMM method and CRF method etc. can be adopted to carry out participle.
Step 3, carries out filtration treatment to described language material word, obtains the dictionary comprising multiple keyword, and dimension-reduction treatment is carried out to the dictionary built, Fig. 2 is the process flow diagram of the detailed process of the dictionary dimension reduction method of the embodiment of the present invention, and as shown in Figure 2, step 3 specifically comprises following process:
Step 201, filters language material word according to part of speech, retains noun, verb and adjective; These parts of speech are larger as the possibility of text key word, and other parts of speech are very little as the possibility of keyword, so directly consider the word of these parts of speech, improve execution efficiency.
Step 202, filters remaining language material word according to the frequency, and retain the language material word that the frequency is greater than frequency threshold value, all the other language material words are given up.
Step 203, according to each Question and Answer log, counts the intentional classification involved by Question and Answer log, such as, comprises the intention classifications such as weather, shopping, work, tourism.
Step 204, calculate the information entropy of each keyword in dictionary and preserve, wherein, information entropy represents the probability that this keyword occurs in each information classification.
Step 205, obtains the information entropy of keyword, judges whether the information entropy of keyword is less than the information entropy threshold value pre-set, and if the judgment is Yes, then performs step 207, otherwise, perform step 206.
Step 206, keyword information entropy being greater than information entropy threshold value is retained in dictionary.
Step 207, keyword information entropy being less than information entropy threshold value is deleted from dictionary.
In actual applications, when information entropy equals information entropy threshold value, both can corresponding keyword be retained in described dictionary, also corresponding keyword can be deleted from described dictionary.
In sum, by means of the technical scheme of the embodiment of the present invention, by utilizing the information entropy fast filtering candidate word of word in different classes of, dimensionality reduction is carried out to the dictionary built, solve the time that dictionary dimension reduction method of the prior art needs at substantial, the problem of optimal result can not be reached by repeatedly tuning, can filter fast and carry out dimensionality reduction to the useless word of classification to dictionary, the dictionary after dimensionality reduction has good accuracy rate for classification results.
Embodiment of the method two
According to embodiments of the invention, provide a kind of information classification approach, the dictionary dimension reduction method in embodiment of the method one is comprised according to the information classification approach of the embodiment of the present invention, in information classification approach, described information classification comprises: text describes classifying content, text emotion classification, advertisement category classification or Spam filtering classification.That is, in embodiments of the present invention, dimensionality reduction can be carried out to dictionary according to different information classifications, such as, in step 104 in embodiment of the method one, need the information entropy calculating keyword, if dictionary is text describe the dictionary needed in classifying content process, when then calculating the information entropy of each keyword in described dictionary, information entropy represents that this keyword describes the probability occurred in classifying content at each text; If dictionary is the dictionary needed in text emotion classification process, then when calculating the information entropy of each keyword in described dictionary, information entropy represents the probability that this keyword occurs in each text emotion classification; If dictionary is the dictionary needed in the process of advertisement category classification, then when calculating the information entropy of each keyword in described dictionary, information entropy represents the probability that this keyword occurs in each advertisement category classification; If dictionary is the dictionary needed in Spam filtering classification process, then when calculating the information entropy of each keyword in described dictionary, information entropy represents the probability that this keyword occurs in each Spam filtering classification.
Be described in detail in said method embodiment one at the dictionary dimension reduction method of the embodiment of the present invention, do not repeated them here.
In sum, by means of the technical scheme of the embodiment of the present invention, by dictionary dimension reduction method, fasterly information classification can be carried out accurately.
Device embodiment one
According to embodiments of the invention, provide a kind of dictionary dimensionality reduction device, Fig. 3 is the structural representation of the dictionary dimensionality reduction device of the embodiment of the present invention, as shown in Figure 3, comprise according to the dictionary dimensionality reduction device of the embodiment of the present invention: pretreatment module 30, word-dividing mode 32, filtering module 34 and computing module 36.
Below the modules of the embodiment of the present invention is described in detail.
Pretreatment module 30, for carrying out pre-service to the language material obtained from Question and Answer log, obtains text data; Pre-service comprises: be text formatting by the uniform format of corpus data, filters one or more in dirty word, sensitive word and stop words, and the text data after filtering is divided into multirow according to punctuate.Such as, above-mentioned punctuate can be question mark, exclamation, branch or fullstop, that is, the text data after filtration can be divided into multirow according to question mark, exclamation, branch or fullstop.
Word-dividing mode 32, for carrying out word segmentation processing to text data, obtains multiple language material word; Word segmentation processing adopt in dictionary two-way maximum matching method, viterbi method, HMM method and CRF method one or more.
Filtering module 34, for carrying out filtration treatment to language material word, obtains the dictionary comprising multiple keyword; Filtration treatment can adopt any one or two kinds of modes following:
Mode one: filter language material word according to part of speech, retains noun, verb and adjective;
Mode two: filter language material word according to the frequency, retain the language material word that the frequency is greater than frequency threshold value, wherein, the frequency refers to the frequency that language material word occurs in corpus data or number of times.
In the present embodiment, first according to part of speech, language material word is filtered, only retain noun, verb and adjective, remove the language material word of other part of speech; Then according to the frequency, the noun retained, verb and adjective are filtered, retain the language material word that the frequency is greater than frequency threshold value, thus for the frequency is greater than the noun of frequency threshold value, verb and adjective in dictionary.
In other embodiments of the invention, only can filter according to part of speech, also can only filter according to the frequency, can also first carry out filtering filtering according to part of speech according to the frequency, it be all within protection scope of the present invention again.
Computing module 36, for the information classification related to according to Question and Answer log statistics language material, calculate the information entropy of each keyword in dictionary, keyword information entropy being less than information entropy threshold value is deleted from dictionary, wherein, information entropy represents the probability that this keyword occurs in each information classification.
Preferably, different according to the purposes of dictionary, information classification involved in computing module 36 can intention classification involved by language material, and wherein, above-mentioned intention classification comprises: weather, shopping, work, tourism etc.; Certainly, intention classification is only a kind of mode classification in information classification, and different according to the user of dictionary, information classification also changes thereupon.
When information classification can intention classification involved by language material, computing module 36 calculates the information entropy of keyword in dictionary and comprises: calculate the probability that in dictionary, each keyword occurs in each intention classification.
The computing formula of information entropy is: H (X)=-Σ p (x i) logp (x i), wherein, H (X) represents the information entropy of keyword, p (x i) represent the probability that keyword occurs in i-th intention classification, i=1,2 ..., n, n are the number of intention classification.
The concrete process of embodiment of the present invention modules is described in detail in corresponding embodiment of the method, does not repeat them here.
In sum, by means of the technical scheme of the embodiment of the present invention, by utilizing the information entropy fast filtering candidate word of word in different classes of, dimensionality reduction is carried out to the dictionary built, solve the time that dictionary dimension reduction method of the prior art needs at substantial, the problem of optimal result can not be reached by repeatedly tuning, can filter fast and carry out dimensionality reduction to the useless word of classification to dictionary, the dictionary after dimensionality reduction has good accuracy rate for classification results.
Device embodiment two
According to embodiments of the invention, provide a kind of information sorting device, the dictionary dimensionality reduction device in device embodiment one is comprised according to the information sorting device of the embodiment of the present invention, wherein, the information classification involved by information sorting device comprises: text describes classifying content, text emotion classification, advertisement category classification or Spam filtering classification.That is, in embodiments of the present invention, dimensionality reduction can be carried out to dictionary according to different information classifications, such as, dictionary dimensionality reduction device needs the information entropy calculating keyword, if dictionary is text describe the dictionary needed in classifying content process, then when calculating the information entropy of each keyword in described dictionary, information entropy represents that this keyword describes the probability occurred in classifying content at each text; If dictionary is the dictionary needed in text emotion classification process, then when calculating the information entropy of each keyword in described dictionary, information entropy represents the probability that this keyword occurs in each text emotion classification; If dictionary is the dictionary needed in the process of advertisement category classification, then when calculating the information entropy of each keyword in described dictionary, information entropy represents the probability that this keyword occurs in each advertisement category classification; If dictionary is the dictionary needed in Spam filtering classification process, then when calculating the information entropy of each keyword in described dictionary, information entropy represents the probability that this keyword occurs in each Spam filtering classification.
Be described in detail in said apparatus embodiment one at the dictionary dimensionality reduction device of the embodiment of the present invention, do not repeated them here.
In sum, by means of the technical scheme of the embodiment of the present invention, dimensionality reduction device by the help of a dictionary, fasterly can carry out information classification accurately.
Obviously, those skilled in the art can carry out various change and modification to the present invention and not depart from the spirit and scope of the present invention.Like this, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.
Intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with display at this algorithm provided.Various general-purpose system also can with use based on together with this teaching.According to description above, the structure constructed required by this type systematic is apparent.In addition, the present invention is not also for any certain programmed language.It should be understood that and various programming language can be utilized to realize content of the present invention described here, and the description done language-specific is above to disclose preferred forms of the present invention.
In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and adaptively can change the module in the client in embodiment and they are arranged in one or more clients different from this embodiment.Block combiner in embodiment can be become a module, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or client or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions of some or all parts be loaded with in the client of sequence network address that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.

Claims (16)

1. a dictionary dimension reduction method, is characterized in that, comprising:
Pre-service is carried out to the language material obtained from Question and Answer log, obtains text data;
Word segmentation processing is carried out to described text data, obtains multiple language material word;
Filtration treatment is carried out to described language material word, obtains the dictionary comprising multiple keyword;
The information classification that described language material relates to is added up according to described Question and Answer log, calculate the information entropy of each keyword in described dictionary, keyword information entropy being less than information entropy threshold value is deleted from described dictionary, and wherein, described information entropy represents the probability that this keyword occurs in each information classification.
2. the method for claim 1, it is characterized in that, described pre-service comprises: be text formatting by the uniform format of corpus data, filters one or more in dirty word, sensitive word and stop words, and the text data after filtering is divided into multirow according to punctuate.
3. the method for claim 1, is characterized in that, described word segmentation processing adopt in the two-way maximum matching method of dictionary, viterbi method, HMM method and CRF method one or more.
4. the method for claim 1, is characterized in that, described filtration treatment adopts any one or two kinds of modes following:
According to part of speech, described language material word is filtered, retain noun, verb and adjective;
According to the frequency, described language material word is filtered, retain the language material word that the frequency is greater than frequency threshold value.
5. the method for claim 1, is characterized in that, described information classification comprises the intention classification involved by described language material; The information entropy calculating keyword in described dictionary comprises: calculate the probability that in described dictionary, each keyword occurs in each intention classification.
6. the method for claim 1, is characterized in that, the computing formula of described information entropy is: H (X)=-Σ p (x i) logp (x i), wherein, H (X) represents the information entropy of keyword, p (x i) represent the probability that keyword occurs in i-th intention classification, i=1,2 ..., n, n are the number of intention classification.
7. an information classification approach, is characterized in that, comprises the dictionary dimension reduction method according to any one of claim 1-6.
8. method as claimed in claim 7, it is characterized in that, described information classification comprises: text describes classifying content, text emotion classification, advertisement category classification or Spam filtering classification.
9. a dictionary dimensionality reduction device, is characterized in that, comprising:
Pretreatment module, for carrying out pre-service to the language material obtained from Question and Answer log, obtains text data;
Word-dividing mode, for carrying out word segmentation processing to described text data, obtains multiple language material word;
Filtering module, for carrying out filtration treatment to described language material word, obtains the dictionary comprising multiple keyword;
Computing module, for adding up the information classification that described language material relates to according to described Question and Answer log, calculate the information entropy of each keyword in described dictionary, keyword information entropy being less than information entropy threshold value is deleted from described dictionary, wherein, described information entropy represents the probability that this keyword occurs in each information classification.
10. device as claimed in claim 9, it is characterized in that, described pretreatment module specifically for: be text formatting by the uniform format of corpus data, filter one or more in dirty word, sensitive word and stop words, and the text data after filtering is divided into multirow according to punctuate.
11. devices as claimed in claim 9, is characterized in that, described word-dividing mode specifically for: one or more adopting in dictionary two-way maximum matching method, viterbi method, HMM method and CRF method carry out word segmentation processing.
12. devices as claimed in claim 9, is characterized in that, described filtering module specifically for:
Any one or two kinds of modes following are adopted to carry out filtration treatment:
According to part of speech, described language material word is filtered, retain noun, verb and adjective;
According to the frequency, described language material word is filtered, retain the language material word that the frequency is greater than frequency threshold value.
13. devices as claimed in claim 9, it is characterized in that, described information classification comprises the intention classification involved by described language material;
Described computing module specifically for: calculate the probability that in described dictionary, each keyword occurs in each intention classification.
14. devices as claimed in claim 9, it is characterized in that, the computing formula of described information entropy is: H (X)=-Σ p (x i) logp (x i), wherein, H (X) represents the information entropy of keyword, p (x i) represent the probability that keyword occurs in i-th intention classification, i=1,2 ..., n, n are the number of intention classification.
15. 1 kinds of information sorting devices, is characterized in that, comprise the dictionary dimensionality reduction device according to any one of claim 9-14.
16. devices as claimed in claim 15, it is characterized in that, described information classification comprises: text describes classifying content, text emotion classification, advertisement category classification or Spam filtering classification.
CN201510874528.6A 2015-12-02 2015-12-02 Dictionary dimension reducing method and device and information classifying method and device Pending CN105512104A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510874528.6A CN105512104A (en) 2015-12-02 2015-12-02 Dictionary dimension reducing method and device and information classifying method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510874528.6A CN105512104A (en) 2015-12-02 2015-12-02 Dictionary dimension reducing method and device and information classifying method and device

Publications (1)

Publication Number Publication Date
CN105512104A true CN105512104A (en) 2016-04-20

Family

ID=55720097

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510874528.6A Pending CN105512104A (en) 2015-12-02 2015-12-02 Dictionary dimension reducing method and device and information classifying method and device

Country Status (1)

Country Link
CN (1) CN105512104A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202349A (en) * 2016-06-29 2016-12-07 杭州华三通信技术有限公司 Web page classifying dictionary creation method and device
CN106294689A (en) * 2016-08-05 2017-01-04 浪潮电子信息产业股份有限公司 A kind of method and apparatus selecting based on text category feature to carry out dimensionality reduction
CN107943791A (en) * 2017-11-24 2018-04-20 北京奇虎科技有限公司 A kind of recognition methods of refuse messages, device and mobile terminal
CN108230037A (en) * 2018-01-12 2018-06-29 北京深极智能科技有限公司 Advertisement base method for building up, ad data recognition methods and storage medium
CN108268431A (en) * 2016-12-30 2018-07-10 北京国双科技有限公司 The method and apparatus of paragraph vectorization
CN112241748A (en) * 2019-07-16 2021-01-19 广州汽车集团股份有限公司 Data dimension reduction method and device based on multi-source information entropy difference
CN112507088A (en) * 2019-09-16 2021-03-16 顺丰科技有限公司 Text processing method, device, server and storage medium
CN112925903A (en) * 2019-12-06 2021-06-08 农业农村部信息中心 Text classification method and device, electronic equipment and medium
CN114357982A (en) * 2021-12-30 2022-04-15 有米科技股份有限公司 Data processing method and device for constructing domain dictionary

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101630312A (en) * 2009-08-19 2010-01-20 腾讯科技(深圳)有限公司 Clustering method for question sentences in question-and-answer platform and system thereof
CN101777042A (en) * 2010-01-21 2010-07-14 西南科技大学 Neural network and tag library-based statement similarity algorithm
CN102662976A (en) * 2012-03-12 2012-09-12 浙江工业大学 Text feature weighting method based on supervision
CN103425777A (en) * 2013-08-15 2013-12-04 北京大学 Intelligent short message classification and searching method based on improved Bayesian classification
CN103902570A (en) * 2012-12-27 2014-07-02 腾讯科技(深圳)有限公司 Text classification feature extraction method, classification method and device
CN104050256A (en) * 2014-06-13 2014-09-17 西安蒜泥电子科技有限责任公司 Initiative study-based questioning and answering method and questioning and answering system adopting initiative study-based questioning and answering method
CN104063387A (en) * 2013-03-19 2014-09-24 三星电子(中国)研发中心 Device and method abstracting keywords in text
CN104281653A (en) * 2014-09-16 2015-01-14 南京弘数信息科技有限公司 Viewpoint mining method for ten million microblog texts
CN104361010A (en) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 Automatic classification method for correcting news classification

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101630312A (en) * 2009-08-19 2010-01-20 腾讯科技(深圳)有限公司 Clustering method for question sentences in question-and-answer platform and system thereof
CN101777042A (en) * 2010-01-21 2010-07-14 西南科技大学 Neural network and tag library-based statement similarity algorithm
CN102662976A (en) * 2012-03-12 2012-09-12 浙江工业大学 Text feature weighting method based on supervision
CN103902570A (en) * 2012-12-27 2014-07-02 腾讯科技(深圳)有限公司 Text classification feature extraction method, classification method and device
CN104063387A (en) * 2013-03-19 2014-09-24 三星电子(中国)研发中心 Device and method abstracting keywords in text
CN103425777A (en) * 2013-08-15 2013-12-04 北京大学 Intelligent short message classification and searching method based on improved Bayesian classification
CN104050256A (en) * 2014-06-13 2014-09-17 西安蒜泥电子科技有限责任公司 Initiative study-based questioning and answering method and questioning and answering system adopting initiative study-based questioning and answering method
CN104281653A (en) * 2014-09-16 2015-01-14 南京弘数信息科技有限公司 Viewpoint mining method for ten million microblog texts
CN104361010A (en) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 Automatic classification method for correcting news classification

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RADA MIHALCEA 等: "TextRank:Bringing Order into Texts", 《UNT SCHOLARLY WORKS》 *
陈涛 等: "文本分类中的特征降维方法综述", 《情报学报》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202349B (en) * 2016-06-29 2020-08-21 新华三技术有限公司 Webpage classification dictionary generation method and device
CN106202349A (en) * 2016-06-29 2016-12-07 杭州华三通信技术有限公司 Web page classifying dictionary creation method and device
CN106294689A (en) * 2016-08-05 2017-01-04 浪潮电子信息产业股份有限公司 A kind of method and apparatus selecting based on text category feature to carry out dimensionality reduction
CN106294689B (en) * 2016-08-05 2018-09-25 浪潮电子信息产业股份有限公司 A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature
CN108268431A (en) * 2016-12-30 2018-07-10 北京国双科技有限公司 The method and apparatus of paragraph vectorization
CN108268431B (en) * 2016-12-30 2019-12-03 北京国双科技有限公司 The method and apparatus of paragraph vectorization
CN107943791A (en) * 2017-11-24 2018-04-20 北京奇虎科技有限公司 A kind of recognition methods of refuse messages, device and mobile terminal
CN108230037A (en) * 2018-01-12 2018-06-29 北京深极智能科技有限公司 Advertisement base method for building up, ad data recognition methods and storage medium
CN108230037B (en) * 2018-01-12 2022-10-11 北京字节跳动网络技术有限公司 Advertisement library establishing method, advertisement data identification method and storage medium
CN112241748A (en) * 2019-07-16 2021-01-19 广州汽车集团股份有限公司 Data dimension reduction method and device based on multi-source information entropy difference
CN112241748B (en) * 2019-07-16 2024-06-14 广州汽车集团股份有限公司 Data dimension reduction method and device based on multi-source information entropy difference
CN112507088A (en) * 2019-09-16 2021-03-16 顺丰科技有限公司 Text processing method, device, server and storage medium
CN112925903A (en) * 2019-12-06 2021-06-08 农业农村部信息中心 Text classification method and device, electronic equipment and medium
CN112925903B (en) * 2019-12-06 2024-03-29 农业农村部信息中心 Text classification method, device, electronic equipment and medium
CN114357982A (en) * 2021-12-30 2022-04-15 有米科技股份有限公司 Data processing method and device for constructing domain dictionary

Similar Documents

Publication Publication Date Title
CN105512104A (en) Dictionary dimension reducing method and device and information classifying method and device
CN105389307A (en) Statement intention category identification method and apparatus
CN108304468B (en) Text classification method and text classification device
CN110162750B (en) Text similarity detection method, electronic device and computer readable storage medium
CN108038119A (en) Utilize the method, apparatus and storage medium of new word discovery investment target
CN103838798B (en) Page classifications system and page classifications method
CN104504150A (en) News public opinion monitoring system
CN104598532A (en) Information processing method and device
CN104537097A (en) Microblog public opinion monitoring system
CN108733675B (en) Emotion evaluation method and device based on large amount of sample data
CN105589941A (en) Emotional information detection method and apparatus for web text
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN104866478A (en) Detection recognition method and device of malicious text
CN104361037A (en) Microblog classifying method and device
de Oliveira et al. FS-NER: A lightweight filter-stream approach to named entity recognition on twitter data
CN107861945A (en) Finance data analysis method, application server and computer-readable recording medium
CN106569989A (en) De-weighting method and apparatus for short text
CN107862051A (en) A kind of file classifying method, system and a kind of document classification equipment
CN110990587B (en) Enterprise relation discovery method and system based on topic model
CN112287240A (en) Case microblog evaluation object extraction method and device based on double-embedded multilayer convolutional neural network
CN106933798B (en) Information analysis method and device
CN110704611B (en) Illegal text recognition method and device based on feature de-interleaving
CN107045497A (en) A kind of quick newsletter archive content sentiment analysis system and method
CN116561298A (en) Title generation method, device, equipment and storage medium based on artificial intelligence
CN105095386A (en) Device and method for determining web page quality

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160420