CN109740152A

CN109740152A - Determination method, apparatus, storage medium and the computer equipment of text classification

Info

Publication number: CN109740152A
Application number: CN201811592736.7A
Authority: CN
Inventors: 张长旺; 张纪红
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2019-05-10
Anticipated expiration: 2038-12-25
Also published as: CN109740152B

Abstract

This application involves determination method, apparatus, computer readable storage medium and the computer equipments of a kind of text classification, which comprises extracts the keyword of text to be processed, and determines the weight of each keyword；Obtain semantic description information corresponding with each keyword；According to each semantic description information, first degree of correlation of each keyword respectively with each candidate classification is determined；According to the weight of each keyword and each first degree of correlation, second degree of correlation of the text to be processed respectively with each candidate classification is determined；According to each second degree of correlation, classification belonging to the text to be processed is determined from each candidate classification.Scheme provided by the present application can save human cost, and eliminate dependence of the quality for determining classification belonging to text to be processed to the quality manually marked.

Description

Determination method, apparatus, storage medium and the computer equipment of text classification

Technical field

This application involves field of computer technology, determination method, apparatus, computer more particularly to a kind of text classification Readable storage medium storing program for executing and computer equipment.

Background technique

Text classification mark refers to that by text marking be one or more classifications in a bibliography system.Text classification It is labeled in a large number of services scene such as advertisement, recommendation, search and has a wide range of applications.Determine that classification belonging to text is text class Important link in target note.

In the method for determination of traditional text classification, first as manually marking classification belonging to several texts, obtain training sample The machine learning models such as neural network are trained to obtain mapping model further according to training sample by this, and then by text to be processed Originally it is input in mapping model, the classification of text to be processed is determined by mapping model.However, artificial mark obtains training sample Process, consume a large amount of manpower.Also, mapping model be according to manually mark training sample training obtain, therefore really The quality of classification belonging to fixed text to be processed has serious dependence to the quality manually marked.

Summary of the invention

Based on this, it is necessary to for consuming a large amount of manpower in traditional approach, and determine class belonging to text to be processed Purpose quality has the technical issues of serious dependence to the quality manually marked, provides a kind of determination side of text classification Method, device, computer readable storage medium and computer equipment.

A kind of determination method of text classification, comprising:

The keyword of text to be processed is extracted, and determines the weight of each keyword；

Obtain semantic description information corresponding with each keyword；

According to each semantic description information, first degree of correlation of each keyword respectively with each candidate classification is determined；

According to the weight of each keyword and each first degree of correlation, determine the text to be processed respectively with each institute State second degree of correlation of candidate classification；

According to each second degree of correlation, classification belonging to the text to be processed is determined from each candidate classification.

A kind of determining device of text classification, comprising:

Keyword processing module for extracting the keyword of text to be processed, and determines the weight of each keyword；

Semantic description data obtaining module, for obtaining semantic description information corresponding with each keyword；

First degree of correlation determining module, for according to each semantic description information, determine each keyword respectively with each time Select first degree of correlation of classification；

Second degree of correlation determining module is determined for the weight and each first degree of correlation according to each keyword The text to be processed second degree of correlation with each candidate classification respectively；

Text classification determining module, described in being determined from each candidate classification according to each second degree of correlation Classification belonging to text to be processed.

A kind of computer readable storage medium is stored with computer program, when the computer program is executed by processor, So that the processor executes the step of determination method of text classification as described above.

A kind of computer equipment, including memory and processor, the memory are stored with computer program, the calculating When machine program is executed by the processor, so that the processor executes the step of the determination method of text classification as described above Suddenly.

Determination method, apparatus, computer readable storage medium and the computer equipment of above-mentioned text classification are extracted to be processed The keyword of text, and the weight of each keyword is obtained, semantic description information corresponding with each keyword is then obtained, then According to each semantic description information, determine each keyword respectively with first degree of correlation of candidate classification, then according to each keyword Weight and each first degree of correlation determine second degree of correlation of the text to be processed respectively with each candidate classification, and then according to each second The degree of correlation determines classification belonging to text to be processed from each candidate classification.In this way, can not have any affiliated classification In the case where the text known, classification belonging to any text is determined as computer equipment whole-process automaticly, to eliminate artificial The link for marking classification, saves human cost, and removes the quality for determining classification belonging to text to be processed to artificial The dependence of the quality of mark.

Detailed description of the invention

Fig. 1 is the applied environment figure of the determination method of text classification in one embodiment；

Fig. 2 is the flow diagram of the determination method of text classification in one embodiment；

Fig. 3 is the process schematic that first degree of correlation of keyword and candidate classification is determined in one embodiment；

Fig. 4 is the process schematic that second degree of correlation of text and candidate classification is determined in one embodiment；

Fig. 5 is the interface schematic diagram of the classification annotation results of displaying and query text in one embodiment；

Fig. 6 is the flow diagram of the method for determination of the first proportion threshold value in one embodiment；

Fig. 7 is the process schematic that remaining word number is determined during determining the first proportion threshold value in one embodiment；

Fig. 8 is the process schematic that first degree of correlation of keyword and candidate classification is determined in one embodiment；

Fig. 9 is the process schematic that first degree of correlation of keyword and candidate classification is determined in one embodiment；

Figure 10 is the interface schematic diagram of manual entry association knowledge in one embodiment；

Figure 11 is the interface schematic diagram of manual entry classification precedence information in one embodiment；

Figure 12 is the flow diagram of the determination method of text classification in one embodiment；

Figure 13 is the structural block diagram of the determining device of text classification in one embodiment；

Figure 14 is the structural block diagram of computer equipment in one embodiment；

Figure 15 is the structural block diagram of computer equipment in one embodiment.

Specific embodiment

It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, and It is not used in restriction the application.

It should be noted that term " first " used in this application, " second " etc. are for making to similar object Differentiation in name, but these objects itself should not be limited by these terms.It should be appreciated that in the feelings for not departing from scope of the present application Under condition, these terms can be interchanged in appropriate circumstances.For example, " first participle " can be described as to " the second participle ", and class As, " the second participle " is described as " first participle ".

The determination method for the text classification that each embodiment of the application provides, can be applied in application environment as shown in Figure 1. The application environment can be related to terminal 110 and server 120, and terminal 110 and server 120 pass through network connection.

Specifically, terminal 110 obtains text to be processed, and text to be processed is sent to server 120.Server 120 The keyword of text to be processed is extracted, and obtains the weight of each keyword, then obtains semanteme corresponding with each keyword and retouches Information is stated, first degree of correlation of each keyword and candidate classification, and then basis are then determined according to each semantic description information respectively The weight of each keyword and each first degree of correlation determine second degree of correlation of the text to be processed respectively with each candidate classification, according to Each second degree of correlation determines classification belonging to text to be processed from each candidate classification.

In other application environment, server 120 can also be only related to, without regard to terminal 110, accordingly, by server 120 It independently executes from text to be processed is obtained to the series of steps for determining classification belonging to text to be processed from each candidate classification. Alternatively, terminal 110 can also be only related to, without regard to server 120, accordingly, independently executed by terminal 110 from acquisition text to be processed Originally to the series of steps for determining classification belonging to text to be processed from each candidate classification.

Wherein, terminal 110 may include mobile phone, tablet computer, laptop, desktop computer, personal digital assistant, wear At least one of formula equipment etc. is worn, but not limited to this.Server 120 can use independent server either multiple servers The server cluster of composition is realized.

In one embodiment, as shown in Fig. 2, providing a kind of determination method of text classification.It is applied in this way It is illustrated for computer equipment (such as terminal 110 or server 120 in above-mentioned Fig. 1).This method may include as follows Step S202 to S210.

S202, extracts the keyword of text to be processed, and obtains the weight of each keyword.

Text to be processed is the text it needs to be determined that affiliated classification.Text to be processed can be short text, and short text is The shorter text of text size, for example it is no more than the text of 160 characters, common short text includes micro-blog information, article mark Topic, viewpoint comment, SMS and literature summary, but not limited to this.Text to be processed may be long text, and long text is phase Compared with the longer text of text size for short text.

Keyword can be word representative in text to be processed, can be used for characterizing the master of text to be processed Inscribe thought.Specifically, keyword extraction processing can be carried out to text to be processed, to obtain the keyword of text to be processed. Keyword extraction processing can be realized using any applicable keyword extraction mode, such as TextRank algorithm, Rake algorithm And Topic-Model algorithm etc., it is not specifically limited herein.

The weight of keyword can be used for characterizing the keyword to the significance level of text to be processed.The weight of keyword It can be determined according to the TF-IDF value of the keyword.Wherein, the TF-IDF value of keyword is the keyword in text to be processed Word frequency (Term Frequency, TF) multiplied by the keyword inverse document frequency (Inverse Document Frequency, IDF)。

Word frequency of the keyword in text to be processed is frequency of occurrence of the keyword in text to be processed.Keyword Inverse document frequency can be with are as follows:

In one embodiment, target corpus can be the corresponding corpus of network search service.Accordingly, target corpus The number of object in all objects in library comprising the keyword, can be and scanned for by network search service to keyword The number of the total number of obtained all search results, all objects in target corpus can be set to predetermined value, such as It is set as: 1+100000000.

It should be noted that can obtain together after calling network search service to scan for keyword to the key The number for all search results that word scans for and semantic description information corresponding with the keyword, and by the two into Row association.Accordingly, when getting semantic description information corresponding with keyword, it can get together and the keyword is carried out The obtained number of all search results this parameter is searched for, can directly take this when calculating the inverse document frequency of keyword Parameter, without temporarily calling network search service to obtain this parameter.

In one embodiment, the weight of each keyword is determined according to the TF-IDF value of each keyword of text to be processed, Can specifically realize in the following way: computer equipment is determined according to the TF-IDF value of each keyword of text to be processed respectively The original weight of each keyword, then the original weight of each keyword is normalized respectively, obtain the weight of each keyword.Its In, the original weight of keyword is normalized, can be the original weight by the keyword divided by each of text to be processed The summation of the original weight of keyword.In addition, the original weight of keyword can be TF-IDF value of the keyword itself, it can also To be the TF-IDF value of the keyword and the long product of word of the keyword.

It illustrates, it is assumed that text to be processed is " trap that banker widens land ", the pass of the text to be processed of extraction Keyword and the corresponding weight of each keyword is obtained, can be " banker: 0.3；Big land: 0.5；Trap: 0.2 ".Wherein, keyword Between with "；" separate, it is the corresponding weight of keyword after ": ", the summation of the weight of each keyword of text to be processed is 1.

S204 obtains semantic description information corresponding with each keyword.

Semantic description information is the information for being used to help understand meaning expressed by keyword.The number of semantic description information It can be text file according to form.

In one embodiment, the corresponding semantic description information of keyword can according to related personnel arrange, for retouching The information (hereinafter referred to as expert's description information) for stating the keyword determines that related personnel can be the expert of related fields.Tool Body can arrange corresponding expert's description information for each candidate keywords by expert, further according to each candidate keywords, Matching relationship between each expert's description information and each candidate keywords and each expert's description information constructs expert knowledge library, Accordingly, when needing to obtain the semantic description information of keyword, candidate pass corresponding with the keyword is searched in expert knowledge library Keyword, the semantic description information of the keyword may include the matched expert's description information of candidate keywords institute found.

S206 determines first degree of correlation of each keyword respectively with each candidate classification according to each semantic description information.

First degree of correlation of keyword and candidate classification, can be used for measuring between the keyword and candidate's classification The metric of matching degree.The value range of first degree of correlation can be [0 ,+1], and first degree of correlation is bigger, show the keyword Matching degree between candidate's classification is higher, otherwise first degree of correlation is smaller, show the keyword and candidate's classification it Between matching degree it is lower.

In the present embodiment, candidate classification more than one, computer equipment is respectively according to the corresponding semantic description of each keyword Information determines first degree of correlation of each keyword respectively with each candidate classification.

For example, as shown in figure 3, carrying out keyword extraction to text LT1 to be processed obtains 3 keywords: keyword Kw1, keyword Kw2 and keyword Kw3, keyword Kw1 corresponds to semantic description information Sd1, keyword Kw2 corresponds to semantic description letter Breath Sd2, keyword Kw3 correspond to semantic description information Sd3, and there are 3 candidate's classifications: candidate classification C1, candidate classification C2 and Candidate classification C3.

Accordingly, computer equipment determines the first phase of keyword Kw1 with candidate classification C1 according to semantic description information Sd1 First degree of correlation and keyword Kw1 of Guan Du, keyword Kw1 and candidate classification C2 and first degree of correlation of candidate classification C3. Also, computer equipment determines first degree of correlation, the key of keyword Kw2 and candidate classification C1 according to semantic description information Sd2 First degree of correlation and keyword Kw2 of word Kw2 and candidate classification C2 and first degree of correlation of candidate classification C3.And it calculates Machine determines first degree of correlation, keyword Kw3 and the candidate class of keyword Kw3 and candidate classification C1 according to semantic description information Sd3 First degree of correlation of first degree of correlation and keyword Kw3 of mesh C2 and candidate classification C3.

In one embodiment, computer equipment can be according to the corresponding semantic description information of each keyword and each candidate class Purpose classification description information determines first degree of correlation of each keyword respectively with each candidate classification.Such as computer equipment according to The classification description information of semantic description information Sd1 and candidate classification C1 determine that keyword Kw1 is related to the first of candidate classification C1 Degree.Wherein, the classification description information of candidate classification can be used for the information for reflecting the characteristic of candidate's classification.

S208, according to the weight of each keyword and each first degree of correlation, determine text to be processed respectively with each candidate classification Second degree of correlation.

Second degree of correlation of text to be processed and candidate classification, can be used for measuring the text to be processed and candidate's class The metric of matching degree between mesh.The value range of second degree of correlation can be [0 ,+1], and second degree of correlation is bigger, show Matching degree between the text to be processed and candidate's classification is higher, otherwise second degree of correlation is smaller, shows the text to be processed Originally lower the matching degree between candidate's classification.

In the present embodiment, for each candidate classification, computer equipment is according to each keyword of text to be processed Weight and each keyword are weighted summation with first degree of correlation of candidate's classification respectively, obtain the text to be processed and the time Select second degree of correlation of classification.

Aforementioned exemplary is accepted, as shown in figure 4, computer equipment is according to the weight of keyword Kw1, keyword Kw1 and candidate First degree of correlation of classification C1, the weight of keyword Kw2, keyword Kw2 and candidate classification C1 first degree of correlation, keyword The weight and keyword Kw3 of Kw3 and first degree of correlation of candidate classification C1 are weighted summation, obtain text LT1 to be processed With second degree of correlation of candidate classification C1.Also, computer equipment is according to the weight of keyword Kw1, keyword Kw1 and candidate class First degree of correlation of mesh C2, the weight of keyword Kw2, keyword Kw2 and candidate classification C2 first degree of correlation, keyword Kw3 Weight and first degree of correlation of keyword Kw3 and candidate classification C2 be weighted summation, obtain text LT1 to be processed and Second degree of correlation of candidate classification C2.And computer equipment is according to the weight of keyword Kw1, keyword Kw1 and candidate classification First degree of correlation of C3, the weight of keyword Kw2, first degree of correlation of keyword Kw2 and candidate classification C3, keyword Kw3 First degree of correlation of weight and keyword Kw3 and candidate classification C3 are weighted summation, obtain text LT1 to be processed and wait Select second degree of correlation of classification C3.

S210 determines classification belonging to text to be processed from each candidate classification according to each second degree of correlation.

Classification belonging to text to be processed, may include in each candidate classification and second degree of correlation of text to be processed is full The candidate classification of sufficient degree of correlation screening conditions.Wherein, degree of correlation screening conditions can be set according to actual needs.

Specifically, degree of correlation screening conditions may include: to second degree of correlation of text to be processed equal to or more than related Threshold value is spent, relevance threshold predefines according to actual needs.Degree of correlation screening conditions also may include: and text to be processed Second degree of correlation belongs to second degree of correlation of the maximum predetermined number of numerical value in each second degree of correlation, i.e., according to each second degree of correlation Numerical values recited be ranked up, be sequentially reduced from front to back, classification belonging to text to be processed may include be arranged in front it is pre- Determine candidate classification corresponding to second degree of correlation of number, predetermined number can be set as any positive integer according to actual needs.

(target domain is subjected to regulation it should be noted that bibliography system can be preset and division is formed by body System), it include the candidate classification of more than one in bibliography system.Accordingly, can include from bibliography system according to each second degree of correlation Each candidate classification in determine classification belonging to text to be processed.

In one embodiment, after step S210, can also include the following steps: according to belonging to text to be processed Classification carries out classification mark to the text to be processed.Classification mark specifically can be the corresponding class target of the output text to be processed Note is as a result, such mesh annotation results may include the text to be processed, classification belonging to the text to be processed and this is to be processed The degree of correlation of text and the classification belonging to it.Wherein, the degree of correlation of the text to be processed and the classification belonging to it can be according to this Second degree of correlation of text to be processed and the classification belonging to it determines, for example can be the text to be processed and the classification belonging to it Second degree of correlation itself.

Accordingly, in practical application scene, automatic marking system can be built based on the determination method of text classification, it should be certainly Dynamic labeling system can be used for carrying out classification mark to text.In addition, automatic marking system can also provide the class target of text Infuse the displaying and query service of result.Displaying and query interface can with as shown in figure 5, user can click it is " previous in interface The text of classification mark is completed by page browsing for the control of page " and " lower one page "；Text can also be inputted in input frame 500 Or text ID, then click " inquiry " control, with inquire the classification ID of classification belonging to the text, category name and its with The degree of correlation of text；Category name or classification ID can also be inputted in input frame 500, to inquire the text of respective class now This.

It should be noted that training sample is obtained, further according to instruction as manually marking classification belonging to several texts for elder generation Practice sample the machine learning models such as neural network are trained to obtain mapping model, and then text input to be processed is extremely mapped In model, the traditional approach of the classification of text to be processed is determined by mapping model, due to having the following defects, is unable to satisfy wide For the demand of classification belonging to determining mass text in a large amount of actual services scenes such as announcement, recommendation, search.

(1) process for manually marking training sample, consumes a large amount of manpower.In particular it is required that the training manually marked The quantity of sample can increase and linear increase with the quantity for the candidate classification for including in bibliography system.Such as, it is assumed that support one A candidate's classification needs manually 10,000 training datas of mark, supports that the bibliography system comprising 1000 candidate classifications just needs people Work marks 10,000,000 training datas, this will expend huge man power and material.

(2) quality that the quality heavy dependence of classification belonging to text to be processed manually marks is determined.Pass through mapping model Determine classification belonging to text, it is desirable to determine classification belonging to text in high quality, then require the mark manually to training sample With high accuracy rate, at the same the ratio distribution for each candidate classification for requiring training sample to include in bibliography system with it is really whole Body sample is consistent.However the training sample that generates of artificial mark (especially large-scale artificial mark) be difficult to meet it is above-mentioned It is required that and in practical applications, the quality of the training sample manually marked is generally poor, so as to cause can not be high in traditional approach Determine classification belonging to text to be processed to quality.

(3) it can not know and can not automatically update mapping model by studying new knowledge automatically.Traditional approach is from the instruction manually marked Practice in sample, learns to the knowledge for mapping the text to classification, mapping model is obtained, in this way, just cannot without new training sample Automatically new knowledge is arrived in study, and mapping model can not be automatically updated.However, the neologisms and hot spot word that are continued to bring out on network, For traditional approach, if not introducing artificial participation mark again, mapping model will be unable to understand comprising these neologisms and hot spot The text to be processed of word, so that classification belonging to the text to be processed can not be accurately determined.

(4) professional knowledge that will manually accumulate is not supported, is introduced during determining classification belonging to text to be processed.? In actual services in common scene, related personnel has the professional knowledge of the largely classification belonging to determining text, than If which keyword is related to which classification, which classification should be paid the utmost attention in each candidate classification that bibliography system includes Etc., these professional knowledges of related personnel's accumulation are introduced during determining classification belonging to text to be processed, Neng Gouti Height determines the quality of classification belonging to text to be processed, but by text input to be processed to mapping model, mapping in traditional approach Model, that is, exportable characterize classification belonging to the text to be processed as a result, intermediate processing logic is abstract hardly possible for people In understanding, manual intervention can not be carried out, determines text institute to be processed to not support to introduce the professional knowledge manually accumulated During the classification of category.

The determination method of text classification provided by the embodiments of the present application, extracts the keyword of text to be processed, and obtains each Then the weight of keyword obtains semantic description information corresponding with each keyword, further according to each semantic description information, really Fixed each keyword respectively with first degree of correlation of candidate classification, then according to the weight of each keyword and each first degree of correlation, really Second degree of correlation of the text to be processed respectively with each candidate classification is determined, and then according to each second degree of correlation, from each candidate classification Determine classification belonging to text to be processed.In this way, can be in the case where not having text known to any affiliated classification, by counting It calculates machine equipment and determines classification belonging to any text whole-process automaticly, to eliminate the link of artificial mark classification, save Human cost, and remove dependence of the quality for determining classification belonging to text to be processed to the quality manually marked.This Outside, during obtain classification belonging to text to determination text to be processed to be processed, intermediate logic is for people can With understanding, so that becoming during the professional knowledge manually accumulated is introduced classification belonging to determining text to be processed It may.

In one embodiment, the step of extracting the keyword of text to be processed, may include steps of: to be processed Text carries out word segmentation processing, obtains multiple first participles of text to be processed；It is rejected from each first participle and belongs to goal filtering The first participle of dictionary obtains one or more second and segments；Each second participle includes the remaining first participle after rejecting；According to Each second participle, obtains the keyword of text to be processed.

Word segmentation processing, for being partitioned into several words from text to be processed.Word segmentation processing can use any possibility Participle mode realize, such as condition random field (Conditional Random Field, CRF) participle, JIEBA participle (i.e. Stammerer participle), NLPIR participle, LTP (Language Technology Platform) participle or THULAC (THU Lexical Analyzer for Chinese) participle etc..

Wherein, condition random field participle is to comprehensively consider word according to condition random field theory and go out in text to be processed The context relation of existing frequency and word segments text to be processed, and which has ambiguity word and neologisms good Good participle effect.

It should be noted that when carrying out word segmentation processing to text to be processed following optimisation strategy can be used: can be by book Content of text in name is integrally used as a participle, and without being split, for example text to be processed is that " " teenager sends magical Drift about " in tiger metaphor meaning ", will the magical drift of group " teenager " it is whole as a participle, it is " juvenile without being partitioned into The words such as group ", " magical drift ".

It the case where being short text for text to be processed, can also be using at least one in following two optimisation strategies: Content of text in the bracket of predetermined form can be integrally used as to a participle, without being split, the bracket of predetermined form Including at least one of round bracket, bracket and braces；Can by text to be processed, close to first of text head end Content of text before colon is integrally used as a participle, and without being split, for example text to be processed is " LPL war communique: difficult It allows one to chase after two WE2:1 to turn over TOP ", by " LPL war communique " integrally as a participle.

By integrally regarding the content of text in said circumstances as a participle, without being split, effectively remain The physical meaning of certain text content is conducive to the accuracy for improving determining keyword.

Dictionary is filtered, is the database for record filtering word.Filtering dictionary can be corresponded with data source.Text institute The data source of category is the source of the text, for example the subject text for the public platform that user pays close attention to can be belonged to a data Source, the description text that the title text for the article that user read is belonged to a data source, the commodity for buying user It belongs to a data source, the description text for the video that user watched is belonged into data source etc..

The filter word recorded in filtering dictionary, is the word for needing to reject from each first participle of text to be processed, It may include at least one of word common in the word and the text covered of respective data sources for do not have physical meaning.It can be with Understand, the common word in the text that data source covers, it is difficult to the characteristic of single text in the data source is characterized, thus according to Each text that the word is difficult to cover the data source makes differentiation, therefore can be using the word as the filtering of the data source Word is taken in the corresponding filtering dictionary of the data source.

Goal filtering dictionary is filtering dictionary corresponding with target data source.Target data source is text institute to be processed The data source of category.When it is implemented, computer equipment can first determine data source (i.e. target data belonging to text to be processed Source), then determine the corresponding filtering dictionary of the target data source (i.e. goal filtering dictionary).

In the present embodiment, computer equipment can carry out word segmentation processing to text to be processed, obtain text to be processed Each first participle, then from each first participle, the first participle identical with each filter word recorded in goal filtering dictionary is rejected, And then according to the first participle remaining after rejecting (the i.e. second participle), the keyword of text to be processed is obtained.

In one embodiment, the mode for constructing goal filtering dictionary, may include steps of: to belonging to target data The text in source carries out word segmentation processing, obtains multiple third participles；Determine that each third segments corresponding first ratio respectively；According to One ratio is more than the third participle of the first proportion threshold value, constructs goal filtering dictionary.

Third segments corresponding first ratio, can be the textual data comprising third participle in target data source and accounts for target The ratio of the text sum of data source.Assuming that " sports news " this target data source covers 10 texts, wherein 6 text packets It is segmented containing " match " this third, then this third of " competing ", which segments corresponding first ratio, is

First proportion threshold value, as measuring whether third participle is word common in each text that target data source covers Standard.It is more than the first proportion threshold value that third, which segments corresponding first ratio, shows that third participle is that target data source covers Each text in common word, should be used as the filter word of the target data source；Third segments corresponding first ratio and is less than First proportion threshold value shows that third participle is not word common in each text that target data source covers, should not be used as the mesh Mark the filter word of data source.Wherein, the first proportion threshold value can using it is any it is applicable by the way of determine, such as by manually according to reality Border demand is set.Aforementioned exemplary is accepted, the first proportion threshold value is set as 20%, " match " this third participle corresponding first RatioIt has been more than 20%, then this third of " competing " participle is each text that " sports news " this target data source covers In common word, should be used as the filter word of the target data source.

In the present embodiment, target data source covers several texts, and computer equipment can cover target data source Each text carry out word segmentation processing respectively, thus obtain several thirds participle.It is segmented for each third, computer Equipment determines that the textual data comprising third participle in the target data source accounts for the ratio of the text sum of the target data source, with It obtains each third and segments corresponding first ratio.In turn, from each

Each third participle that the first ratio is more than the first proportion threshold value is filtered out in third participle, it is each further according to what is filtered out Third participle, constructs goal filtering dictionary.Accordingly, it is more than the first ratio that the first ratio is recorded in the goal filtering dictionary constructed Each third of threshold value segments.

It in one embodiment, can also be by manually by virtue of experience determining do not have physical meaning for target data source Word constructs goal filtering dictionary further according to the word for not having physical meaning for target data source.Accordingly, the mesh constructed Word manually determining, not having physical meaning for target data source is recorded in mark filtering dictionary.

It in one embodiment, can also be according to the word for not having physical meaning for target data source by manually determining Language and the first ratio are more than each third participle of the first proportion threshold value, common to construct goal filtering dictionary.That is, construct The word and the first ratio for not having physical meaning for target data source by manually determining are recorded in goal filtering dictionary simultaneously Example is segmented more than each third of the first proportion threshold value.

It should be noted that the corresponding filtering dictionary of building data source can be the preparation being previously-completed.Specifically, The corresponding filtering dictionary of each data source can be constructed in advance determines text to be processed after getting this paper to be marked Affiliated data source (i.e. target data source) is needing to use corresponding filtering dictionary (the i.e. goal filtering word of target data source Library) when, goal filtering dictionary directly is found from the corresponding filtering dictionary of each data source constructed in advance, without facing When construct goal filtering dictionary again.In addition, the corresponding filtering dictionary of each data source can regularly update.

In one embodiment, in addition to the mode for the artificial settings being described above determines the first proportion threshold value, such as Fig. 6 institute Show, can also determine the first proportion threshold value: S602 using following steps, according to current proportion threshold value, from each third participle really Fixed 4th participle；S604 determines remaining word number corresponding with each text of target data source is belonged to；S606 is determined remaining The textual data that word number is equal to or more than word number threshold value accounts for the second ratio of the text sum of target data source；S608, in the second ratio When example is less than the second proportion threshold value, current proportion threshold value is determined as the first proportion threshold value；S610 is more than the in the second ratio When two proportion threshold values, current proportion threshold value is updated according to numerical value is lowered, and return according to current proportion threshold value, segmented from each third The step of middle determination the 4th segments.

4th participle may include the third participle that the first ratio is equal to or more than current proportion threshold value.Specifically, from right Each text for belonging to target data source carries out in each third participle that word segmentation processing obtains, and the first ratio filtered out is equal to or greatly It is segmented in the third of current proportion threshold value, the as the 4th participle.

The corresponding remaining word number of text can be remaining the after rejecting the 4th participle in each third of text participle The number of three participles.

For example, as shown in Figure 7, it is assumed that belong to target data source each text be respectively text CT1, CT2 and CT3 carries out word segmentation processing to CT1, obtains third participle Pc3-1, Pc3-2, Pc3-3, Pc3-4 and Pc3-5, carries out to CT2 Word segmentation processing obtains third participle Pc3-1, Pc3-2, Pc3-6, Pc3-7 and Pc3-8, carries out word segmentation processing to CT3, obtain Third segments Pc3-6, Pc3-7, Pc3-8, Pc3-9 and Pc3-10, then carries out word segmentation processing to CT1, CT2 and CT3 and obtain Each third participle be respectively that third segments Pc3-1 and segments Pc3-10 to third, amount to 10 thirds participle.

If third segments Pc3-1 into third participle Pc3-10, the first ratio is equal to or more than current proportion threshold value N1's Third participle (the i.e. the 4th participle) is respectively Pc3-1, Pc3-2, Pc3-3, Pc3-4 and Pc3-9.Then, from each the of text CT1 In three participles (Pc3-1, Pc3-2, Pc3-3, Pc3-4 and Pc3-5), remaining third participle is Pc3- after rejecting the 4th participle 5, that is, correspond to current proportion threshold value N1, the corresponding remaining word number of text CT1 is 1.It is segmented from each third of text CT2 In (Pc3-1, Pc3-2, Pc3-6, Pc3-7 and Pc3-8), remaining third participle is Pc3-6, Pc3- after rejecting the 4th participle 7 and Pc3-8 corresponds to current proportion threshold value N1, the corresponding remaining word number of text CT2 is 3.From each the of text CT3 In three participles (Pc3-6, Pc3-7, Pc3-8, Pc3-9 and Pc3-10), remaining third participle is after rejecting the 4th participle Pc3-6, Pc3-7, Pc3-8 and Pc3-10 correspond to current proportion threshold value N1, the corresponding remaining word number of text CT3 is 4.

Second ratio is the text sum that remaining word number accounts for target data source equal to or more than the textual data of word number threshold value Ratio, word number threshold value can be determined according to actual needs.Aforementioned exemplary is accepted, each text difference of target data source is belonged to For text CT1, CT2 and CT3, amount to 3 texts, it is assumed that word number threshold value is 3, then remaining word number is equal to or more than 3 text Respectively CT2 and CT3, the second ratio are

Can the second proportion threshold value be used as the first proportion threshold value for measuring current proportion threshold value.After determining the second ratio, Judge that the second ratio whether more than the second proportion threshold value, if not exceeded, current proportion threshold value is determined as the first proportion threshold value, is tied Beam determines the process of the first proportion threshold value；If being more than, show that current proportion threshold value can not be as the first proportion threshold value, then under It adjusts numerical value to update current proportion threshold value, i.e., subtracts downward numerical value on the basis of current proportion threshold value, work as further according to updated Preceding proportion threshold value re-executes the step of determining the 4th participle from each third participle and its subsequent step.Second proportion threshold value It can determine according to actual needs, for example can be set to 90%.

Furthermore, it is possible to be corresponding first proportion threshold value of data source, the is being determined corresponding to a data source During one proportion threshold value, when determining current proportion threshold value for the first time, initial proportion threshold value is determined as current proportion threshold value. Initial proportion threshold value can predefine according to actual needs, for example can be set to 100%.

It should be noted that being determined according to current proportion threshold value surplus compared to the mode that the first proportion threshold value is manually set The text that remaining word number is equal to or more than word number threshold value accounts for the second ratio for belonging to each text of the target data source, is determining the When two ratios are more than the second proportion threshold value, reduce current proportion threshold value, then redefine the second ratio, until the second ratio does not surpass When crossing the second proportion threshold value, current proportion threshold value is determined as the first proportion threshold value.In this way, realizing automatic by computer equipment It determines the first proportion threshold value, and improves the accuracy of identified first proportion threshold value.

In one embodiment, according to each second participle, the step of obtaining the keyword of text to be processed, may include as Lower step: permutation and combination is carried out according to each second participle, obtains the 5th participle；From each 5th participle, the 6th participle is determined；From In each 6th participle, the 7th participle is determined；According to the 7th participle, the keyword of text to be processed is obtained.

5th participle, at least two second participles comprising continuous adjacent.Specifically, it rejects and belongs to from each first participle After the first participle of goal filtering dictionary, according to preset permutation and combination rule to remaining each first participle (i.e. each Two participles) permutation and combination is carried out, the portmanteau word of all at least two second participles comprising continuous adjacent is obtained, each portmanteau word is For each 5th participle.

For example, remaining each first point after rejecting the first participle for belonging to goal filtering dictionary in each first participle Word is " China ", " people " and " liberation army ", and after carrying out permutation and combination according to preset permutation and combination rule, one be there are To following 3 comprising continuous adjacent at least two second participle portmanteau words: " Chinese people ", " PLA " and " Chinese People's Liberation Army ", this 3 portmanteau words are 3 the 5th participles.It should be noted that due to " China " and " liberation army " Non- continuous adjacent, " Chinese liberation army " are not the 5th participles.

6th participle can be the 5th participle for belonging to existing entry.Existing entry may include encyclopaedia entry, encyclopaedia word Item is the entry that can be searched by encyclopaedia search service.For example, encyclopaedia entry may include the word included in Baidupedia The entry included in item and wikipedia.Specifically, it is existing can to judge whether each 5th participle belongs to respectively for computer equipment Entry, then each 5th participle of existing entry will be belonged to as each 6th participle.

7th participle is not included in the 6th participle in each 6th participle in addition to itself.Wherein, if being wrapped in participle B Full content in the A containing participle then segments A and is contained in participle B, if only comprising the partial content in participle A in participle B, and Not comprising participle A in full content, then segment A be not included in participle B in (participle A and participle B be mutually different any two A participle, the description of " A " and " B " are only the differentiation made in name).

For example, each 6th participle is " Chinese people ", " PLA " and " Chinese people's liberation respectively Army ".This 3 the 6th are segmented, since " Chinese people " are only comprising the partial content in " Chinese People's Liberation Army ", " China The people " are contained in " Chinese People's Liberation Army ", and " Chinese people " are not the 7th participles.Since " PLA " also only includes Partial content in " Chinese People's Liberation Army ", " PLA " are also contained in " Chinese People's Liberation Army ", " people liberation Army " is not the 7th participle.Only " Chinese People's Liberation Army " is not included in any 6th participle in addition to itself and (neither wraps Contained in " Chinese people ", also it is not included in " PLA "), therefore the 7th determined from this 3 the 6th participles Participle is " Chinese People's Liberation Army ".

In one embodiment, the keyword of text to be processed is obtained according to the 7th participle, specifically can be the 7th point Keyword of the word as text to be processed.

In one embodiment, the step of obtaining each semantic description information corresponding with each keyword, i.e. step S204 may include steps of: obtain web search message corresponding with each keyword；Respectively according to each keyword pair The web search message answered obtains each semantic description information corresponding with each keyword.

The corresponding web search message of keyword can scan for obtaining by network search service to the keyword. Wherein, network search service can be based on internet carry out information search service, may include Web search service and At least one of encyclopaedia search service.Web search service such as Baidu's Web search service, Google's Web search service etc.. Encyclopaedia search service such as Baidupedia search, wikipedia search service etc., but not limited to this.

In one embodiment, it may include being scanned for by network search service to the keyword that network, which searches prime information, Obtained target search result.If target search result may include in the obtained all search results of search the degree of correlation it is highest Dry search result (specific item number can be set according to actual needs).For example, in general, by network search service to pass Each search result that keyword scans for is ranked up according to the degree of correlation from high to low, and the degree of correlation is successively from front to back It reduces, it accordingly can be using be arranged in front 50 search results as target search result.

In one embodiment, according to the corresponding web search message of keyword, semanteme corresponding with the keyword is obtained Description information, specifically can be and the corresponding semantic description information of the keyword includes that web search corresponding with the keyword is believed Breath.

In another embodiment, it in conjunction with being described above, according to the corresponding expert's description information of keyword and can also be somebody's turn to do The corresponding web search message of keyword, it is common to determine semantic description information corresponding with the keyword.Specifically, with the key The corresponding semantic description information of word can include that web search message corresponding with the keyword and the keyword are corresponding simultaneously Expert's description information.

In one embodiment, the step of obtaining search result corresponding with each keyword, may include walking as follows It is rapid: to call network search service to scan for respectively to each keyword, obtain web search knot corresponding with each keyword Fruit.

In the present embodiment, when needing to obtain web search results corresponding with keyword every time, network is temporarily called Search service scans for the keyword, to obtain web search results corresponding with keyword.

In one embodiment, obtain web search message corresponding with each keyword the step of, may include as Lower step: the corresponding candidate keywords of each keyword are searched in local information library respectively；Local information library records candidate key Matching relationship between word and candidate search information, candidate search information are by network search service to corresponding candidate keyword It scans for obtaining；When finding candidate keywords corresponding with keyword, matched according to the candidate keywords found Candidate search information, obtain the corresponding web search message of the keyword；Candidate pass corresponding with keyword is not being found When keyword, calls network search service to scan for the keyword, obtain web search message corresponding with the keyword.

In the present embodiment, network search service can be called to scan for respectively to each candidate keywords in advance, searched for The target search result corresponding with each candidate keywords arrived, as each candidate search information, and then generation record have each The number of matching relationship between candidate keywords, each candidate search information and each candidate keywords and each candidate search information It stores according to library, then by the content in the database to computer equipment, to obtain local information library.

When subsequent needs obtain web search message corresponding with keyword, computer equipment can be directly in local information Candidate keywords corresponding with the keyword are searched in library.Candidate pass corresponding with the keyword is found in local information library Keyword shows to call network search service to search for the keyword in advance, can be not repeated to call network search service at this time Search for the keyword, and directly will the matched candidate search information of candidate keywords institute corresponding with the keyword, as the pass The corresponding web search message of keyword.

On the contrary, not finding candidate keywords corresponding with the keyword in local information library, show do not have in advance Have and network search service is called to search for the keyword, it also just can be corresponding as the keyword without storage in local information library Web search message candidate search information.At this point, computer equipment can temporarily call network search service to the key Word scans for, and the target search result corresponding with the keyword searched is the corresponding web search letter of the keyword Breath.Further, it is also possible to using the keyword and the target search result corresponding with the keyword searched as newly-increased candidate Keyword and newly-increased candidate search information, being updated to local information library and record has each candidate keywords, each candidate search letter The database of breath and the matching relationship between each candidate keywords and each candidate search information.

It should be noted that not finding candidate keywords corresponding with the keyword in local information library, just adjust The keyword is scanned for network search service, can greatly promote the effect for determining classification belonging to text to be processed Rate.And in practical application, the quantity of the keyword often occurred in text be it is more limited, run up to thousand in local information library After the candidate keywords of ten thousand ranks, just seldom needs the network search service for recalling outside to obtain the corresponding network of keyword and search Rope information, so as to extremely efficiently complete to determine the task of classification belonging to mass text.

Further, it is also possible to timing to record have each candidate keywords, each candidate search information and each candidate keywords with The database of matching relationship between each candidate search information is updated.For example, re-calling network every scheduled number of days Search service scans for each candidate keywords in database, to update the corresponding each candidate of each candidate keywords Search for information.

In one embodiment, it respectively according to the corresponding web search message of each keyword, obtains distinguishing with each keyword It the step of corresponding each semantic description information, may include steps of: network corresponding with each keyword being searched respectively Rope information carries out data cleansing, obtains each semantic description information corresponding with each keyword.

Data cleansing can remove incoherent with the keyword itself from the corresponding web search message of keyword Information.Correspondingly, the corresponding semantic description information of the keyword may include removal and the incoherent information of the keyword itself Remaining information afterwards.It wherein, may include date, web site name, video playing letter with the incoherent information of the keyword itself Breath, music information and common network address etc., but not limited to this.

For example, the web search message of keyword is " for " 2018, bean cotyledon 8.5 divided, comparable to " the surprise of teenager group Unreal drift "！_ Sohu amusement _ sohu.com [online to play] ", after carrying out data cleansing, will remove following information: " 2018 ", " _ Sohu's amusement ", " _ sohu.com " and " [online to play] ".

In one embodiment, according to each semantic description information, first of each keyword respectively with each candidate classification is determined It the step of degree of correlation, i.e. step S206, may include steps of: according to the classification of each semantic description information and each candidate classification Title determines the third degree of correlation of each semantic description information respectively with each candidate classification；According to each third degree of correlation, each pass is determined Keyword first degree of correlation with each candidate classification respectively.

The category name of candidate classification is the title of candidate's classification.It only includes single level that category name, which can be, Title, such as " mobile phone app ".Category name is also possible to include that the title of more than one level accordingly can be using predetermined Connector is separated each level, for example " mobile phone app- game-moba " is the category name for including 3 levels, using "-" This connector is separated each level.

The third degree of correlation of semantic description information and candidate classification, according to the class of the semantic description information and candidate's classification Mesh title is determined, and can be used for the measurement for measuring the matching degree between the semantic description information and candidate's classification Value.The value range of the third degree of correlation can be [0 ,+1], and the third degree of correlation is bigger, shows according to the semantic description information and is somebody's turn to do For the category name of candidate classification, the matching degree between the semantic description information and candidate's classification is higher, otherwise third The degree of correlation is smaller, for showing the category name according to the semantic description information and the candidate classification, the keyword and the candidate Matching degree between classification is lower.

According to being described above, for each keyword, computer equipment can be retouched according to the corresponding semanteme of the keyword The classification description information for stating information and each candidate classification determines first degree of correlation of the keyword respectively with each candidate classification.? In the present embodiment, the classification description information of candidate classification may include the category name of candidate's classification, accordingly, for each Keyword, for computer equipment according to the category name of the corresponding semantic description information of the keyword and each candidate classification, determining should The corresponding semantic description information of the keyword third degree of correlation with each candidate classification respectively.In turn, corresponding according to the keyword The semantic description information third degree of correlation with each candidate classification respectively, determines first phase of the keyword respectively with each candidate classification Guan Du.

For example, as shown in figure 8, the keyword of text LT1 to be processed is respectively keyword Kw1, keyword Kw2 and pass Keyword Kw3, keyword Kw1 correspond to semantic description information Sd1, keyword Kw2 corresponds to semantic description information Sd2, Kw3 couples of keyword Semantic description information Sd3 is answered, candidate classification is respectively candidate classification C1, candidate classification C2 and candidate classification C3.

Accordingly, computer equipment determines semantic description according to the category name of semantic description information Sd1 and candidate classification C1 The third degree of correlation of information Sd1 and candidate classification C1, thus related to the third of candidate classification C1 according to semantic description information Sd1 Degree determines first degree of correlation of keyword Kw1 and candidate classification C1.

Computer equipment determines semantic description information according to the category name of semantic description information Sd1 and candidate classification C2 The third degree of correlation of Sd1 and candidate classification C2, thus according to the third degree of correlation of semantic description information Sd1 and candidate classification C2, Determine first degree of correlation of keyword Kw1 and candidate classification C2.

Computer equipment determines semantic description information according to the category name of semantic description information Sd1 and candidate classification C3 The third degree of correlation of Sd1 and candidate classification C3, thus according to the third degree of correlation of semantic description information Sd1 and candidate classification C3, Determine first degree of correlation of keyword Kw1 and candidate classification C3.

And so on, determine first of keyword Kw2 respectively with candidate classification C1, candidate classification C2 and candidate classification C3 The degree of correlation determines first degree of correlation of the keyword Kw3 respectively with candidate classification C1, candidate classification C2 and candidate classification C3.

In one embodiment, the third degree of correlation of the corresponding semantic description information of keyword and candidate classification, as should First degree of correlation of keyword and candidate's classification.For example, the third degree of correlation of semantic description information Sd1 and candidate classification C1, First degree of correlation of as keyword Kw1 and candidate classification C1.

In one embodiment, according to the category name of each semantic description information and each candidate classification, determine that each semanteme is retouched State information respectively the third degree of correlation with each candidate classification the step of, may include steps of: according to each semantic description information With the category name of each candidate classification, shared word of each semantic description information respectively with the category name of each candidate classification is determined； From each semantic description information respectively and in the shared word of the category name of each candidate classification, determine each semantic description information respectively with The target of the category name of each candidate's classification shares word；Determine category name of each semantic description information respectively with each candidate classification Target share the long third ratio long with total word of the category name of each candidate classification of total word of word；According to each third ratio, Each semantic description information shares the of word in corresponding semantic description information to the target of the category name of each candidate classification respectively One word frequency and each semantic description information share the first of word against document frequency with the target of the category name of each candidate classification respectively Rate determines the third degree of correlation of each semantic description information respectively with the category name of each candidate classification.

The shared word of semantic description information and the category name of candidate classification, is the semantic description information and candidate's classification Category name in the participle that jointly comprises.For example, semantic description information is that " " king's honor " is by Tencent's development of games and to transport Moba class mobile phone games of the capable a operation on android, ios platform ", the category name of candidate classification is " mobile phone App- game-moba ", the shared word of the two is " mobile phone ", " hand ", " machine ", " game ", " trip ", " play ", " moba ", " m ", " o ", " b ", " a ", " mo ", " ob ", " ba " etc..

In the present embodiment, for each semantic description information, computer equipment determines the semantic description information respectively With the shared word of the category name of each candidate classification.For example, sharing 3 semantic description information: semantic description information Sd1, Semantic description information Sd2 and semantic description information Sd3 shares 3 candidate classifications: candidate classification C1, candidate classification C2 and candidate Classification C3, it is determined that the shared word of semantic description information Sd1 and candidate classification C1, semantic description information Sd1 and candidate classification C2 Shared word and the shared word of semantic description information Sd1 and candidate classification C3 similarly determine semantic description information Sd2 points Not with the shared word and semantic description information Sd3 of candidate classification C1, candidate classification C2 and candidate classification C3 respectively with candidate The shared word of classification C1, candidate classification C2 and candidate classification C3.

The target of semantic description information and the category name of candidate classification shares word, be not included in the semantic description information with In shared word in each shared word of the category name of candidate's classification in addition to itself.With the restriction class above to the 7th participle Seemingly, if comprising the full content in shared word C in shared word D, shared word C is contained in shared word D, if in shared word D only Comprising sharing the partial content in word C, and do not include the full content in shared word C, then shares word C and be not included in shared word D In (shared word C and shared word D is that mutually different any two share word, and the description of " C " and " D " is only to make naming It distinguishes).

For example, each shared word of semantic description information and the category name of candidate classification be respectively as follows: " mobile phone ", " hand ", " machine ".For this 3 shared words, since " hand " is only comprising the partial content in " mobile phone ", " hand " is contained in " mobile phone " In, " hand " is not that target shares word.Since " machine " is also only comprising the partial content in " mobile phone ", " machine " is also contained in " mobile phone ", " machine " is not that target shares word.Only " mobile phone " is not included in any shared word in addition to itself and (was both not included in " hand " In, also it is not included in " machine "), therefore it is " mobile phone " that the target determined from this 3 shared words, which shares word,.

In the present embodiment, for each semantic description information, computer equipment respectively from the semantic description information with In the shared word of the category name of each candidate classification, the classification of the semantic description information and each candidate classification is determined The target of title shares word.

Aforementioned exemplary is accepted, computer equipment is determined from the shared word of semantic description information Sd1 and candidate classification C1 The target of semantic description information Sd1 and candidate classification C1 share word, from the shared word of semantic description information Sd1 and candidate classification C2 In, determine that the target of semantic description information Sd1 and candidate classification C2 shares word, from semantic description information Sd1 and candidate classification In the shared word of C3, determine that semantic description information Sd1 and the target of candidate classification C3 share word.

Computer equipment is respectively from semantic description information Sd2 and candidate classification C1, candidate classification C2 and candidate classification C3 Shared word in, determine mesh of the semantic description information Sd2 respectively with candidate classification C1, candidate classification C2 and candidate classification C3 Mark shared word.

Computer equipment is respectively from semantic description information Sd3 and candidate classification C1, candidate classification C2 and candidate classification C3 Shared word in, determine mesh of the semantic description information Sd3 respectively with candidate classification C1, candidate classification C2 and candidate classification C3 Mark shared word.

Third ratio is that total word length of the shared word of target of the category name of semantic description information and candidate classification accounts for the time The ratio for selecting total word of the category name of classification long.For example, the target designation of candidate classification is " mobile phone app- game-moba ", Assuming that it is " mobile phone ", " game " and " moba " that the target of semantic description information and the category name of candidate classification, which shares word, then should Total word length of the shared word of the target of the category name of semantic description information and the candidate classification is that 8 (" mobile phone " is 2, and " game " is 2, " moba " is 4, is added up to 8), and total word length of the category name of candidate's classification is 11 (due to being long, 3 "-" that calculate total word In connector is not counted in, the total length of " mobile phone app game moba " is that 11), therefore third ratio is

In the present embodiment, for each semantic description information, computer equipment determines the semantic description information respectively The total of the category name of each candidate classification is accounted for total word length that the target of the category name of each candidate classification shares word The long third ratio of word.

Aforementioned exemplary is accepted, computer equipment determines that semantic description information Sd1 and the target of candidate classification C1 share word The long third ratio of total word of total long category name for accounting for candidate classification C1 of word determines semantic description information Sd1 and candidate classification The target of C2 shares the third ratio of total word length of the long category name for accounting for candidate classification C1 of total word of word, determines that semantic description is believed The long third ratio of the total word for the category name that total word length that the target of breath Sd1 and candidate classification C3 share word accounts for candidate classification C1 Example.

Computer equipment determine semantic description information Sd2 respectively with candidate classification C1, candidate classification C2 and candidate classification The target of C3 shares the total of the long category name for accounting for candidate classification C1, candidate classification C2 and candidate classification C3 respectively of total word of word The long third ratio of word.

Computer equipment determine semantic description information Sd3 respectively with candidate classification C1, candidate classification C2 and candidate classification The target of C3 shares the total of the long category name for accounting for candidate classification C1, candidate classification C2 and candidate classification C3 respectively of total word of word The long third ratio of word.

The target of semantic description information and the category name of candidate classification shares first of word in the semantic description information Word frequency is the number that the target shares that word occurs in the semantic description information.For example, semantic description information is that " " king is flourish Credit " be by Tencent's development of games and moba class mobile phone games of a operation on android, ios platform that run ", the language The target of adopted description information and the category name of candidate classification shares word: " mobile phone ", " game " and " moba ", then " hand It is 1 that machine ", " game " and " moba " this 3 targets, which share first word frequency of the word in the semantic description information,.

It is similar with the restriction above to the inverse document frequency of keyword, the category name of semantic description information and candidate classification Target share word the first inverse document frequency may is that

In one embodiment, target corpus can be the corresponding corpus of network search service.Accordingly, target corpus The number for sharing the object of word in all objects in library comprising the target, can be and network search service is called to share word to target The total number of all search results scanned for.The number of all objects in target corpus can be set to predetermined number Value, for example be set as: 1+100000000.

In the present embodiment, for each semantic description information, computer equipment is respectively according to the semantic description information The total of the category name of each candidate classification is accounted for total word length that the target of the category name of each candidate classification shares word The long third ratio of word, the semantic description information share word in the language with the target of the category name of each candidate classification respectively The first word frequency and the semantic description information in adopted the description information target with the category name of each candidate classification respectively First inverse document frequency of shared word determines third of the semantic description information respectively with the category name of each candidate classification The degree of correlation.

In one embodiment, for any semantic description information and any candidate classification, the semantic description information with should The third degree of correlation of candidate classification can be with are as follows:Wherein, N indicates the semantic description information and is somebody's turn to do The target of the category name of candidate classification shares the total number of word, and N is equal to or greater than 1 integer；Rb indicates the semantic description N number of target of information and the category name of candidate's classification shares total word of the long category name for accounting for candidate's classification of total word of word Long third ratio；TF1_iIndicate that N number of target shares i-th of target in word and shares first word of the word in the semantic description information Frequently；IDF1_iIndicate that N number of target shares the first inverse document frequency that i-th of target in word shares word.

In another embodiment, for any semantic description information and any candidate classification, the semantic description information with The third degree of correlation of candidate's classification may be:Wherein, L_iIndicate that N number of target is total There is the word that i-th of target shares word in word long.

It should be noted that the numerical value of the semantic description information being calculated and the third degree of correlation of candidate classification is greater than 1 When, 1 can be set to.

It in one embodiment, can be using set after the shared word for determining semantic description information and category name Mode stores each shared word determined, that is, forms shared set of words, determines that the target of semantic description information and category name is total After having word, each target determined can be stored by the way of set and shares word, that is, forms target and shares set of words.

It in one embodiment, can for the semantic description information and category name of the English character comprising uppercase format To convert small letter for the English character of uppercase format before determining the shared word of the semantic description information and the category name Format, with Uniform data format.

In one embodiment, the determination method of text classification can also include the following steps: to be believed according to each semantic description The predetermined correlation coefficient of predetermined the classification conjunctive word and corresponding candidate classification of breath and each candidate classification, determines each semantic description Information the 4th degree of correlation with each candidate classification respectively.Accordingly, according to each third degree of correlation, determine each keyword respectively with each time The step of selecting first degree of correlation of classification, may include steps of: according to tetra- degree of correlation of each third degree of correlation and Ge, determine Each keyword first degree of correlation with each candidate classification respectively.

The predetermined classification conjunctive word of candidate classification is by word manually determining, with candidate's classification with correlativity Language.The related coefficient of predetermined the classification conjunctive word and candidate's classification of candidate classification, for characterize the predetermined classification conjunctive word with Correlation circumstance between candidate's classification.Wherein, the predetermined classification of the predetermined classification conjunctive word of candidate classification and candidate classification The related coefficient of conjunctive word and candidate's classification, specifically can be by manually predefining according to the experience accumulated in practical business.

The value range of related coefficient is [- 1 ,+1], the phase of predetermined the classification conjunctive word and candidate's classification of candidate classification When relationship number is positive number, indicate that the predetermined classification conjunctive word and candidate's classification are positively correlated, and the bigger expression positive of related coefficient The degree of pass is higher, and related coefficient is smaller to indicate that positively related degree is lower.The predetermined classification conjunctive word and the time of candidate classification When the related coefficient of classification being selected to be negative, indicate that the predetermined classification conjunctive word and candidate's classification are negatively correlated, and related coefficient is got over Big to indicate that negatively correlated degree is lower, related coefficient is smaller to indicate that negatively correlated degree is higher.

4th degree of correlation of semantic description information and candidate classification, according to the semantic description information and candidate's classification Predetermined classification conjunctive word and the candidate classification related coefficient determination, can be used for measuring the semantic description information and the time Select the metric of the matching degree between classification.The value range of 4th degree of correlation can be [0 ,+1], and the 4th degree of correlation is bigger, Show the related coefficient according to the predetermined classification conjunctive word of the semantic description information and candidate's classification and candidate's classification and Speech, the matching degree between the semantic description information and candidate's classification is higher, otherwise the 4th degree of correlation is smaller, shows that basis should For the related coefficient of the predetermined classification conjunctive word of semantic description information and candidate's classification and candidate's classification, the key Matching degree between word and candidate's classification is lower.

According to being described above, for each keyword, computer equipment can be retouched according to the corresponding semanteme of the keyword The classification description information for stating information and each candidate classification determines first degree of correlation of the keyword respectively with each candidate classification.? In the present embodiment, the classification description information of candidate classification may include the predetermined classification conjunctive word and the predetermined class of candidate's classification The predetermined correlation coefficient of mesh conjunctive word and corresponding candidate classification, for each keyword, computer equipment is according to the keyword Corresponding semantic description information, the predetermined classification conjunctive word of each candidate classification, the predetermined classification conjunctive word of each candidate classification and each The related coefficient of self-corresponding candidate's classification determines the of the corresponding semantic description information of the keyword respectively with each candidate classification Four degrees of correlation.In turn, according to the corresponding semantic description information of the keyword respectively with the third degree of correlation of each candidate classification and The corresponding semantic description information of the keyword the 4th degree of correlation with each candidate classification respectively, determine the keyword respectively with each time Select first degree of correlation of classification.

For example, the keyword of text LT1 to be processed is respectively keyword Kw1, keyword Kw2 and keyword Kw3, close Keyword Kw1 corresponds to semantic description information Sd1, keyword Kw2 corresponds to semantic description information Sd2, keyword Kw3 corresponds to semantic description Information Sd3, candidate classification are respectively candidate classification C1, candidate classification C2 and candidate classification C3.

Accordingly, as shown in figure 9, computer equipment is according to the category name of semantic description information Sd1 and candidate classification C1, really The third degree of correlation of attribute justice description information Sd1 and candidate classification C1, according to the pre- of semantic description information Sd1 and candidate classification C1 The related coefficient for determining classification conjunctive word, the predetermined classification conjunctive word and candidate's classification C1 determines semantic description information Sd1 and waits The 4th degree of correlation of classification C1 is selected, and then is retouched according to the third degree of correlation and semanteme of semantic description information Sd1 and candidate classification C1 State the 4th degree of correlation of information Sd1 and candidate classification C1, common first degree of correlation for determining keyword Kw1 and candidate classification C1.

Computer equipment determines semantic description information according to the category name of semantic description information Sd1 and candidate classification C2 The third degree of correlation of Sd1 and candidate classification C2, according to the predetermined classification conjunctive word of semantic description information Sd1 and candidate classification C2, The related coefficient of the predetermined classification conjunctive word and candidate's classification C2 determines the of semantic description information Sd1 and candidate classification C2 Four degrees of correlation, and then according to the third degree of correlation and semantic description information Sd1 of semantic description information Sd1 and candidate classification C2 and wait The 4th degree of correlation of classification C2 is selected, common first degree of correlation for determining keyword Kw1 and candidate classification C2.

Computer equipment determines semantic description information according to the category name of semantic description information Sd1 and candidate classification C3 The third degree of correlation of Sd1 and candidate classification C3, according to the predetermined classification conjunctive word of semantic description information Sd1 and candidate classification C3, The related coefficient of the predetermined classification conjunctive word and candidate's classification C3 determines the of semantic description information Sd1 and candidate classification C3 Four degrees of correlation, and then according to the third degree of correlation and semantic description information Sd1 of semantic description information Sd1 and candidate classification C3 and wait The 4th degree of correlation of classification C3 is selected, common first degree of correlation for determining keyword Kw1 and candidate classification C3.

And so on, computer equipment determine keyword Kw2 respectively with candidate classification C1, candidate classification C2 and candidate class First degree of correlation of mesh C3.Also, determine keyword Kw3 respectively with candidate classification C1, candidate classification C2 and candidate classification C3 First degree of correlation.

Specifically, for any keyword and any candidate classification, can be believed according to the corresponding semantic description of the keyword Cease the 4th phase of corresponding with the third degree of correlation of candidate's classification and the keyword semantic description information and candidate's classification Guan Du is commonly summed, and first degree of correlation of the keyword Yu candidate's classification is obtained.For example, to semantic description information Sd1 It is commonly asked with the third degree of correlation and semantic description information Sd1 and the 4th degree of correlation of candidate classification C1 of candidate classification C1 With first degree of correlation of keyword Kw1 and candidate classification C1 can be obtained.

Alternatively, can be respectively the third degree of correlation and the 4th degree of correlation setting weight, according to the corresponding semanteme of the keyword Description information weight corresponding with the third degree of correlation of candidate's classification, the third degree of correlation, the corresponding semantic description of the keyword Information weight corresponding with the 4th degree of correlation of candidate's classification and the 4th degree of correlation is weighted summation, obtains the key First degree of correlation of word and candidate's classification.For example, by the third degree of correlation of semantic description information Sd1 and candidate classification C1, the The corresponding weight of three degrees of correlation, semantic description information Sd1 power corresponding with the 4th degree of correlation, the 4th degree of correlation of candidate classification C1 It is weighted summation again, first degree of correlation of keyword Kw1 and candidate classification C1 can be obtained.

It include several passes manually determined in Association repository when it is implemented, Association repository can be constructed in advance Join knowledge.In the case, computer equipment can obtain the predetermined classification conjunctive word of each candidate classification according to Association repository With the predetermined correlation coefficient of corresponding candidate classification, closed further according to the predetermined classification of each semantic description information and each candidate classification The predetermined correlation coefficient for joining word and corresponding candidate classification determines that each semantic description information is related to the 4th of each candidate classification the respectively Degree.

In one embodiment, the data format of association knowledge can be " the classification of candidate classification mark (such as classification ID), the category name of candidate classification, the predetermined classification conjunctive word of candidate's classification, the predetermined classification conjunctive word and candidate's class Purpose related coefficient ".

Three exemplary association knowledges are illustrated below:

1, mobile phone app- game-moba, king's honor, 0.8

1, mobile phone app- game-moba, ranking, 0.2

1, mobile phone app- game-moba, heroic alliance, -0.9

Wherein, " king's honor " belongs to one of " mobile phone app- game-moba ", therefore " king's honor " and " mobile phone This positively related degree of candidate classification of app- game-moba " is very high, manually the related coefficient of the two can be set as 0.8. " ranking " is only weak related to " mobile phone app- game-moba ", therefore manually the related coefficient of the two can be set as 0.2." hero Alliance " is not " mobile phone app " although related to " game-moba ", and therefore " heroic alliance " is " mobile phone app- game- The potential confusable word of this candidate classification of moba ", it is very high with the degree of candidate's classification negative correlation, it manually can be by the two Related coefficient is set as -0.9.

It in other embodiments, can also be related to the 4th of each candidate classification the respectively according only to each semantic description information Degree determines first degree of correlation of each keyword respectively with each candidate classification, without consider each semantic description information respectively with each time Select the third degree of correlation of classification.

It should be noted that by by the predetermined classification conjunctive word and predetermined classification conjunctive word by manually determining and corresponding waiting The predetermined correlation coefficient for selecting classification, factor the considerations of as first degree of correlation for determining keyword and candidate classification, allows artificial Automatic marking process is intervened, related service personnel are improved according to the experience accumulated in actual services scene The quality of classification annotation results realizes technical grade human controllable and artificial easily optimization.

In addition, the automatic marking system being described above can also provide association knowledge typing clothes in practical application scene Business.Association knowledge input interface is as shown in Figure 10, and user can click control 1001, then in association knowledge input frame 1002 it is defeated The association knowledge that access customer determines, and then click the manual entry that control 1003 completes respective associated knowledge.In addition, user may be used also It modifies or deletes to click control 1004 to the respective associated knowledge of typing.

In one embodiment, according to the predetermined classification conjunctive word of each semantic description information and each candidate classification, each time The predetermined correlation coefficient for selecting predetermined the classification conjunctive word and corresponding candidate classification of classification determines each semantic description information respectively and respectively It the step of four degree of correlation of candidate classification, may include steps of: according to the predetermined classification conjunctive word point of each candidate classification The second word frequency not in each semantic description information, each candidate classification predetermined classification conjunctive word the second inverse document frequency, with And the predetermined classification conjunctive word of each candidate classification predetermined correlation coefficient with corresponding candidate classification respectively, determine each semantic description letter Cease the 4th degree of correlation respectively with each candidate classification.

Second word frequency of the predetermined classification conjunctive word of candidate classification in semantic description information is the predetermined classification conjunctive word The number occurred in the semantic description information.For example classification conjunctive word is " king's honor ", semantic description information is " " king Honor " be by Tencent's development of games and moba class mobile phone games of a operation on android, ios platform that run ", it should Second word frequency of the predetermined classification conjunctive word in the semantic description information is 1.

Similar with the description above to the inverse document frequency of keyword, the second of the predetermined classification conjunctive word of candidate classification is inverse Document frequency may is that

In one embodiment, target corpus can be the corresponding corpus of network search service.Accordingly, target corpus The number of object in all objects in library comprising the predetermined classification conjunctive word, can be through network search service to predetermined class The number of the total number for all search results that mesh conjunctive word scans for, all objects in target corpus can be set It for predetermined value, for example is set as: 1+100000000.

In the present embodiment, for each semantic description information, computer equipment is respectively according to each candidate classification Predetermined classification conjunctive word the second word frequency in the semantic description information respectively, the predetermined classification association of each candidate classification The predetermined classification conjunctive word of second inverse document frequency of word and each candidate classification respectively with corresponding candidate classification Related coefficient, determine fourth degree of correlation of the semantic description information respectively with the category name of each candidate classification.

In one embodiment, for any semantic description information and any candidate classification, the semantic description information with should 4th degree of correlation of candidate classification may is thatWherein, M indicates the predetermined of candidate's classification The total number of classification conjunctive word, M are equal to or greater than 1 integer；TF2_jIndicate j-th of classification association in M classification conjunctive word Second word frequency of the word in the semantic description information；IDF2_jIndicate second of j-th of classification conjunctive word in M classification conjunctive word Inverse document frequency；Coe_jIndicate the related coefficient of j-th classification conjunctive word and candidate's classification in M classification conjunctive word.

In another embodiment, for any semantic description information and any candidate classification, the semantic description information with 4th degree of correlation of candidate's classification is also possible to:Indicate M class The word of j-th of classification conjunctive word is long in mesh conjunctive word.

It should be noted that the numerical value of the semantic description information being calculated and the 4th degree of correlation of candidate classification is greater than 1 When, 1 can be set to.

The corresponding text of each keyword is determined for the corresponding network description information of each keyword according to text to be processed Description information, and according to each semantic description information respectively the third degree of correlation with each candidate classification (according to each semantic description information Determined with the category name of each candidate classification) and each semantic description information respectively with the 4th degree of correlation of each candidate's classification (according to The predetermined correlation coefficient of the predetermined classification conjunctive word and corresponding candidate classification of each semantic description information and each candidate classification is true At least one of it is fixed), determine second degree of correlation of the text to be processed respectively with each candidate classification, and then according to text to be processed Respectively with second degree of correlation of each candidate classification.There are a basic assumptions: by network search service to from Text Feature Extraction Keyword scans for, if frequently occurring the title or classification conjunctive word of some candidate classification in obtained each search result, The text is closely related with candidate's classification.Concrete analysis about the basic assumption is as follows: the mesh of network search service Be to provide and " dote on the search maximally related content of input information and description, such as " Ha Shiqi " this keyword and candidate classification Object-dog " is closely related, and is scanned for by network search service to " Ha Shiqi ", obtained each search result intermediate frequency Numerous the two keywords of appearance " pet " and " dog ", therefore the basic assumption is to set up in practical applications.

In one embodiment, the determination method of text classification can also include the following steps: to obtain each candidate classification Priority factor.Accordingly, according to each second degree of correlation, the step of classification belonging to text to be processed is determined from each candidate classification Suddenly, i.e. step S210, may include steps of: according to the priority factor of each second degree of correlation and each candidate classification, determine Text to be processed the 5th degree of correlation with each candidate classification respectively；According to each 5th degree of correlation, determined from each candidate classification to Handle classification belonging to text.

The priority factor of candidate classification it is true in practical business can be used to characterize related personnel by manually determining The degree of priority of fixed candidate's classification.

In the present embodiment, for each candidate classification, computer equipment is according to text to be processed and candidate's classification Second degree of correlation and candidate's classification priority factor, determine the 5th degree of correlation of candidate's classification.Specifically, it can incite somebody to action Second degree of correlation of text to be processed and candidate's classification obtains candidate's classification multiplied by the priority factor of candidate's classification 5th degree of correlation.

In addition, the text to be processed when exporting the corresponding classification annotation results of text to be processed, in such mesh annotation results This with its belonging to classification the degree of correlation, can be the text to be processed and its belonging to classification the 5th degree of correlation itself.

It should be noted that by the priority factor of each candidate classification correct text to be processed respectively with each candidate classification Second degree of correlation, obtain fiveth degree of correlation of the text to be processed respectively with each candidate classification, and then according to each 5th degree of correlation Classification belonging to text to be processed is determined from each candidate classification, can be setup flexibly and be needed top-priority candidate classification.

In addition, the automatic marking system being described above can also provide classification precedence information in practical application scene Typing service.Classification precedence information input interface is as shown in figure 11, and user can click control 1101, then in classification priority The classification precedence information (such as classification ID, category name and priority factor) that user determines is inputted in information input frame 1102, And then click the manual entry that 1103 controls complete corresponding classification precedence information.In addition, user can also click control 1104 The corresponding classification precedence information of typing is modified or deleted.

In one embodiment, as shown in figure 12, a kind of determination method of text classification is provided.This method can be by counting It calculates machine equipment to execute, can specifically include following steps S1202 to S1224.

S1202 extracts the keyword of text to be processed, and determines the weight of each keyword.

S1204 searches the corresponding candidate keywords of each keyword in local information library respectively；Local information library record is waited The matching relationship between keyword and candidate search information is selected, candidate search information is by network search service to corresponding candidate Keyword scans for obtaining.

S1206 is matched when finding candidate keywords corresponding with keyword according to the candidate keywords found Candidate search information, obtain the corresponding web search message of the keyword.

S1208 calls network search service to the keyword when not finding candidate keywords corresponding with keyword It scans for, obtains web search message corresponding with the keyword.

S1210 carries out data cleansing to web search message corresponding with each keyword respectively, obtains and each key The corresponding each semantic description information of word.

S1212 determines each semantic description information difference according to the category name of each semantic description information and each candidate classification With the third degree of correlation of each candidate classification.

S1214, according to each semantic description information and the predetermined classification conjunctive word and corresponding candidate classification of each candidate classification Predetermined correlation coefficient, determine fourth degree of correlation of each semantic description information respectively with each candidate classification；Wherein, candidate classification The related coefficient of the predetermined classification conjunctive word of predetermined classification conjunctive word and candidate classification and candidate's classification is by manually determining.

S1216 determines the of each keyword respectively with each candidate classification according to tetra- degree of correlation of each third degree of correlation and Ge One degree of correlation.

S1218, according to the weight of each keyword and each first degree of correlation, determine text to be processed respectively with each candidate classification Second degree of correlation.

S1220 obtains the priority factor of each candidate classification, according to the priority of each second degree of correlation and each candidate classification Coefficient determines fiveth degree of correlation of the text to be processed respectively with each candidate classification.

S1222 determines classification belonging to text to be processed from each candidate classification according to each 5th degree of correlation.

S1224, exports the corresponding classification annotation results of the text to be processed, the corresponding classification annotation results of text to be processed Including belonging to classification belonging to the text to be processed, the text to be processed and the text to be processed and the text to be processed 5th degree of correlation of classification.

It should be noted that the specific restriction of each technical characteristic in the present embodiment, can with hereinbefore to relevant art The restriction of feature is identical, is not added and repeats herein.

It should be appreciated that although each step in the flow chart that each embodiment is related to above is according to arrow under reasonable terms Instruction successively show that but these steps are not that the inevitable sequence according to arrow instruction successively executes.Unless having herein Explicitly stated, there is no stringent sequences to limit for the execution of these steps, these steps can execute in other order.And And at least part step in each flow chart may include multiple sub-steps perhaps these sub-steps of multiple stages or rank Section is not necessarily to execute completion in synchronization, but can execute at different times, these sub-steps or stage Execution sequence is also not necessarily and successively carries out, but can be with the sub-step or stage of other steps or other steps extremely Few a part executes in turn or alternately.

In addition, below to the table of the automatic marking system of the determination method using text classification provided by the embodiments of the present application It is now illustrated: automatic marking system is applied to headline mark a to class comprising 800 candidate classifications Mesh system, and each candidate classification includes 4 levels, it is 91.5% that level-one classification, which marks accuracy rate, and second level classification mark is accurate Rate is 83.1%, and it is 78.4% that three-level classification, which marks accuracy rate, and it is 74.1% that level Four classification, which marks accuracy rate,.Wherein, mark is quasi- True rate (Accuracy)=be correctly labeled as corresponding to text/total amount of text of candidate classification.

In addition, the automatic marking system does not need to expend any manpower mark training sample in annotation process, and if adopting With traditional approach, then need manually to mark 8,000,000 (800 candidate classifications, each candidate's classification mark 10,000) training samples This, so that the quality for needing a large amount of man power and material, and manually marking a large amount of training samples is also difficult to ensure, that is, tradition side Formula does not have actual availability to the mark of the complicated bibliography system comprising multiple candidate classifications.

In one embodiment, as shown in figure 13, a kind of determining device 1300 of text classification is provided.The device 1300 It may include following module 1302 to 1310.

Keyword processing module 1302 for extracting the keyword of text to be processed, and determines the weight of each keyword；

Semantic description data obtaining module 1304, for obtaining semantic description information corresponding with each keyword；

First degree of correlation determining module 1306, for according to each semantic description information, determine each keyword respectively with each time Select first degree of correlation of classification；

Second degree of correlation determining module 1308 is determined for the weight and each first degree of correlation according to each keyword wait locate Manage second degree of correlation of the text respectively with each candidate classification；

Text classification determining module 1310, for determining text to be processed from each candidate classification according to each second degree of correlation Classification belonging to this.

The determining device of above-mentioned text classification, extracts the keyword of text to be processed, and obtains the weight of each keyword, so Semantic description information corresponding with each keyword is obtained afterwards, further according to each semantic description information, determines each keyword difference Text to be processed point is determined then according to the weight of each keyword and each first degree of correlation with first degree of correlation of candidate classification Text to be processed is not determined from each candidate classification and then according to each second degree of correlation with second degree of correlation of each candidate classification Affiliated classification.In this way, can be in the case where not having text known to any affiliated classification, certainly by computer equipment whole process Classification belonging to any text is determined dynamicly, to eliminate the link of artificial mark classification, saves human cost, and clear In addition to dependence of the quality to the quality manually marked of classification belonging to determination text to be processed.In addition, to be processed from obtaining During classification belonging to text to determining text to be processed, intermediate logic will be understood by for people, therefore make The professional knowledge that obtaining manually to accumulate is possibly realized during introducing classification belonging to determining text to be processed.

In one embodiment, keyword processing module 1302 may include such as lower unit: first participle acquiring unit, use In carrying out word segmentation processing to text to be processed, multiple first participles of text to be processed are obtained；Second participle acquiring unit, is used for The first participle for belonging to goal filtering dictionary is rejected from each first participle, is obtained one or more second and is segmented；Second participle Including the remaining first participle after rejecting；Keyword acquiring unit, for obtaining the pass of text to be processed according to each second participle Keyword；Wherein, goal filtering dictionary includes filtering dictionary corresponding with target data source, and target data source includes text to be processed Data source belonging to this.

In one embodiment, the determining device 1300 of text classification can also include following module: third participle obtains Module obtains multiple third participles for carrying out word segmentation processing to each text for belonging to target data source；First ratio-dependent mould Block, for determining that each third segments corresponding first ratio respectively；It includes: target data source that third, which segments corresponding first ratio, In the textual data comprising third participle account for the ratio of the target data source text sum；Goal filtering dictionary constructs module, uses In the third participle for according to the first ratio being more than the first proportion threshold value, goal filtering dictionary is constructed.

In one embodiment, the determining device 1300 of text classification can also include the first proportion threshold value determining module, For determining the 4th participle from each third participle according to current proportion threshold value；4th participle includes that the first ratio is equal to or greatly It is segmented in the third of current proportion threshold value；Determine remaining word number corresponding with each text of target data source is belonged to；Text Corresponding residue word number is the number that remaining third segments after rejecting the 4th participle in each third of text participle；It determines The textual data that remaining word number is equal to or more than word number threshold value accounts for the second ratio for belonging to the text sum of target data source；Second When ratio is less than the second proportion threshold value, current proportion threshold value is determined as the first proportion threshold value；It is more than second in the second ratio When proportion threshold value, current proportion threshold value is updated according to numerical value is lowered, and is returned according to current proportion threshold value, from each third participle The step of determining the 4th participle.

In one embodiment, keyword acquiring unit may include following subelement: the 5th participle obtains subelement, uses In carrying out permutation and combination according to each second participle, the 5th participle is obtained；Each 5th participle comprising continuous adjacent at least two the Two participles；6th participle obtains subelement, for determining the 6th participle from each 5th participle；6th participle includes belonging to There is the 5th participle of entry；7th participle obtains subelement, for determining the 7th participle from each 6th participle；7th participle It is not included in the 6th participle in each 6th participle in addition to itself；Keyword obtains subelement, for being segmented according to the 7th, Obtain the keyword of text to be processed.

In one embodiment, semantic description data obtaining module 1304 may include such as lower unit: web search message Acquiring unit, for obtaining web search message corresponding with each keyword；The corresponding web search message of keyword is The keyword is scanned for obtaining by network search service；Semantic description information acquisition unit, for respectively according to each pass The corresponding web search message of keyword obtains each semantic description information corresponding with each keyword.

In one embodiment, web search message acquiring unit may include following subelement: candidate keywords are searched Subelement, for searching the corresponding candidate keywords of each keyword in local information library respectively；Local information library record is candidate Matching relationship between keyword and candidate search information, candidate search information are to be closed by network search service to corresponding candidate Keyword scans for obtaining；Web search message reading subunit, for finding candidate keywords corresponding with keyword When, according to the matched candidate search information of candidate keywords institute found, obtain the corresponding web search message of the keyword； Web search message searches for subelement, for calling web search when not finding candidate keywords corresponding with keyword Service scans for the keyword, obtains web search message corresponding with the keyword.

In one embodiment, the determining device 1300 of text classification can also include following module: priority factor obtains Modulus block, for obtaining the priority factor of each candidate classification.Accordingly, text classification determining module 1310 may include as placed an order Member: the 5th degree of correlation determination unit determines to be processed for the priority factor according to each second degree of correlation and each candidate classification Text the 5th degree of correlation with each candidate classification respectively；Text classification determination unit is used for according to each 5th degree of correlation, from each time It selects and determines classification belonging to text to be processed in classification.

In one embodiment, the first degree of correlation determining module 1306 may include such as lower unit: the third degree of correlation determines Unit determines each semantic description information respectively and respectively for the category name according to each semantic description information and each candidate classification The third degree of correlation of candidate classification；First degree of correlation determination unit, for determining each keyword difference according to each third degree of correlation With first degree of correlation of each candidate classification.

In one embodiment, third degree of correlation determination unit may include following subelement: shared word determines subelement, For the category name according to each semantic description information and each candidate classification, determine each semantic description information respectively with each candidate class The shared word of purpose category name；Target shares word and determines subelement, for from each semantic description information respectively with each candidate class In the shared word of purpose category name, determine that each semantic description information is shared with the target of the category name of each candidate classification respectively Word；Third ratio-dependent subelement, for determining target of each semantic description information respectively with the category name of each candidate classification The long third ratio of total word of the long category name for accounting for each candidate classification of total word of shared word；The third degree of correlation determines subelement, For sharing word corresponding to the target of the category name of each candidate classification respectively according to each third ratio, each semantic description information The first word frequency and each semantic description information in semantic description information is total with the target of the category name of each candidate classification respectively There is the first inverse document frequency of word, determines that each semantic description information is related to each candidate third of category name of classification respectively Degree；Wherein, the target of semantic description information and the category name of candidate classification shares word, be not included in the semantic description information with In shared word in each shared word of the category name of candidate's classification in addition to itself.

In one embodiment, the determining device 1300 of text classification can also include following module: the 4th degree of correlation is true Cover half block, for making a reservation for according to each semantic description information, the predetermined classification conjunctive word of each candidate classification and each candidate classification The predetermined correlation coefficient of classification conjunctive word and corresponding candidate classification determines the of each semantic description information respectively with each candidate classification Four degrees of correlation.Accordingly, the first degree of correlation determination unit is used to determine each key according to tetra- degree of correlation of each third degree of correlation and Ge Word first degree of correlation with each candidate classification respectively；Wherein, the predetermined classification conjunctive word of candidate classification and candidate classification is pre- Classification conjunctive word is determined with the related coefficient of candidate's classification by manually determining.

In one embodiment, the 4th degree of correlation determining module, for the predetermined classification conjunctive word according to each candidate classification Respectively the second word frequency in each semantic description information, each candidate classification predetermined classification conjunctive word the second inverse document frequency, And the predetermined classification conjunctive word of each candidate classification predetermined correlation coefficient with corresponding candidate classification respectively, determine each semantic description Information the 4th degree of correlation with each candidate classification respectively.

It should be noted that the specific restriction of the determining device 1300 about text classification, may refer to above for The restriction of the determination method of text classification, details are not described herein.Modules in the determining device 1300 of above-mentioned text classification It can be realized fully or partially through software, hardware and combinations thereof.Above-mentioned each module can be embedded in the form of hardware or independently of In processor in computer equipment, it can also be stored in a software form in the memory in computer equipment, in order to locate It manages device and calls the corresponding operation of the above modules of execution.

In one embodiment, a kind of computer equipment, including memory and processor are provided, memory is stored with meter Calculation machine program, when computer program is executed by processor, so that processor executes the text class of the application any embodiment offer Purpose determines the step of method.

Specifically, which can be the server 120 in Fig. 1.As shown in figure 14, which includes Processor, the memory, network interface connected by system bus.Wherein, the processor is for providing calculating and control ability. The memory includes non-volatile memory medium and built-in storage, which is stored with operating system and calculating Machine program, the built-in storage provide environment for the operation of operating system and computer program in non-volatile memory medium.It should Network interface is used to communicate with external terminal by network connection.To realize this Shen when the computer program is executed by processor Please any embodiment provide text classification determination method.

Alternatively, the computer equipment can be the terminal 110 in Fig. 1.As shown in figure 15, which includes the meter Calculating machine equipment includes processor, memory, network interface, input unit and the display screen connected by system bus.Wherein, it deposits Reservoir includes non-volatile memory medium and built-in storage.The non-volatile memory medium of the computer equipment is stored with operation system System, can also be stored with computer program, when which is executed by processor, processor may make to realize that the application is any The determination method for the text classification that embodiment provides.Computer program can also be stored in the built-in storage, the computer program When being executed by processor, processor may make to execute the determination method of the text classification of the application any embodiment offer.It calculates The display screen of machine equipment can be liquid crystal display or electric ink display screen, and the input unit of computer equipment can be aobvious The touch layer covered in display screen is also possible to the key being arranged on computer equipment shell, trace ball or Trackpad, can also be External keyboard, Trackpad or mouse etc..

It will be understood by those skilled in the art that structure shown in Figure 14 and Figure 15, only related to application scheme Part-structure block diagram, do not constitute the restriction for the computer equipment being applied thereon to application scheme, it is specific to count Calculating machine equipment may include perhaps combining certain components or with different portions than more or fewer components as shown in the figure Part arrangement.

In one embodiment, the determining device 1300 for the text classification that each embodiment of the application provides can be implemented as one The form of kind computer program, computer program can be run on such as Figure 14 or computer equipment shown in figure 15.Computer is set Each program module that the determining device 1300 of composition text classification can be stored in standby memory, for example, shown in Figure 13 Keyword processing module 1302, semantic description data obtaining module 1304, first degree of correlation determining module 1306 etc..Each journey The computer program of sequence module composition makes processor execute the video of the application described in this specification each embodiment Step in the determination method of text classification.For example, Figure 14 or computer equipment shown in figure 15 can be by as shown in figure 13 Text classification determining device 1300 in keyword processing module 1302 execute step S202, obtained by semantic description information Modulus block 1304 executes step S204 etc..

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a non-volatile computer and can be read In storage medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, provided herein Each embodiment used in any reference to memory, storage, database or other media, may each comprise non-volatile And/or volatile memory.Nonvolatile memory may include that read-only memory (ROM), programming ROM (PROM), electricity can be compiled Journey ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) directly RAM (RDRAM), straight Connect memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

Accordingly, in one embodiment, a kind of computer readable storage medium is provided, computer program is stored with, is counted When calculation machine program is executed by processor, so that processor executes the determination method of the text classification of the application any embodiment offer The step of.

Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance Shield all should be considered as described in this specification.

The several embodiments of the application above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously The limitation to the application the scope of the patents therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the concept of this application, various modifications and improvements can be made, these belong to the guarantor of the application Protect range.Therefore, the scope of protection shall be subject to the appended claims for the application patent.

Claims

1. a kind of determination method of text classification, comprising:

Obtain semantic description information corresponding with each keyword；

According to the weight of each keyword and each first degree of correlation, determine the text to be processed respectively with each time Select second degree of correlation of classification；

2. the method according to claim 1, wherein the keyword for extracting text to be processed, comprising:

Word segmentation processing is carried out to the text to be processed, obtains multiple first participles of the text to be processed；

The first participle for belonging to goal filtering dictionary is rejected from each first participle, is obtained one or more second and is segmented； Second participle includes the remaining first participle after rejecting；

According to each second participle, the keyword of the text to be processed is obtained；

Wherein, the goal filtering dictionary includes filtering dictionary corresponding with target data source, and the target data source includes Data source belonging to the text to be processed.

3. according to the method described in claim 2, it is characterized in that, constructing the mode of the goal filtering dictionary, comprising:

Word segmentation processing is carried out to each text for belonging to the target data source, obtains multiple third participles；

Determine that each third segments corresponding first ratio respectively；It includes: target that the third, which segments corresponding first ratio, Textual data comprising third participle in data source accounts for the ratio of the target data source text sum；

It is more than the third participle of the first proportion threshold value according to first ratio, constructs the goal filtering dictionary.

4. according to the method described in claim 3, it is characterized in that, determining the mode of first proportion threshold value, comprising:

According to current proportion threshold value, the 4th participle is determined from each third participle；4th participle includes described first The third that ratio is equal to or more than the current proportion threshold value segments；

Determine remaining word number corresponding with each text of the target data source is belonged to；The corresponding remaining word number of the text It is the number that remaining third segments after rejecting the 4th participle in each third participle of the text；

Determine that the textual data that the remaining word number is equal to or more than word number threshold value accounts for the text sum for belonging to the target data source The second ratio；

When second ratio is less than the second proportion threshold value, the current proportion threshold value is determined as the first ratio threshold Value；

When second ratio is more than second proportion threshold value, the current proportion threshold value is updated according to numerical value is lowered, and Return to the step of current proportion threshold value of the basis determines the 4th participle from each third participle.

5. according to the method described in claim 2, obtaining described wait locate it is characterized in that, described according to each second participle Manage the keyword of text, comprising:

Permutation and combination is carried out according to each second participle, obtains the 5th participle；Each 5th participle includes continuous adjacent At least two second participles；

From each 5th participle, the 6th participle is determined；6th participle includes the 5th participle for belonging to existing entry；

From each 6th participle, the 7th participle is determined；7th participle is not included in each 6th participle except certainly In the 6th participle other than body；

According to the 7th participle, the keyword of the text to be processed is obtained.

6. the method according to claim 1, wherein the acquisition each language corresponding with each keyword Adopted description information, comprising:

Obtain web search message corresponding with each keyword；The corresponding web search message of the keyword is logical It crosses network search service the keyword is scanned for obtaining；

Respectively according to the corresponding web search message of each keyword, each semanteme corresponding with each keyword is obtained Description information.

7. according to the method described in claim 6, it is characterized in that, acquisition network corresponding with each keyword Search for information, comprising:

The corresponding candidate keywords of each keyword are searched in local information library respectively；The local information library record is candidate Matching relationship between keyword and candidate search information, the candidate search information are by the network search service to phase Candidate keywords are answered to scan for obtaining；

When finding candidate keywords corresponding with the keyword, according to the matched candidate of candidate keywords institute found Information is searched for, the corresponding web search message of the keyword is obtained；

When not finding candidate keywords corresponding with the keyword, call the network search service to the keyword into Row search, obtains web search message corresponding with the keyword.

8. the method according to claim 1, wherein further include:

Obtain the priority factor of each candidate classification；

It is described that classification belonging to the text to be processed is determined from each candidate classification according to each second degree of correlation, Include:

According to the priority factor of each second degree of correlation and each candidate classification, determine the text to be processed respectively with 5th degree of correlation of each candidate classification；

According to each 5th degree of correlation, classification belonging to the text to be processed is determined from each candidate classification.

9. method according to any one of claims 1 to 8, which is characterized in that described respectively according to each semantic description Information determines first degree of correlation of each keyword and candidate classification, comprising:

According to the category name of each semantic description information and each candidate classification, each semantic description information point is determined Not with the third degree of correlation of each candidate classification；

According to each third degree of correlation, first degree of correlation of each keyword respectively with each candidate classification is determined.

10. according to the method described in claim 9, it is characterized in that, described according to each semantic description information and each described The category name of candidate classification determines the third degree of correlation of each semantic description information respectively with each candidate classification, packet It includes:

According to the category name of each semantic description information and each candidate classification, each semantic description information point is determined Not with the shared word of the category name of each candidate classification；

Respectively and in the shared word of the category name of each candidate classification, each institute's predicate is determined from each semantic description information Target of the adopted description information respectively with the category name of each candidate classification shares word；

The total word for determining that each semantic description information shares word with the target of the category name of each candidate classification respectively is long Account for the long third ratio of total word of the category name of each candidate classification；

According to each third ratio, each semantic description information target with the category name of each candidate classification respectively Shared first word frequency and each semantic description information of the word in corresponding semantic description information respectively with each candidate class The target of purpose category name shares the first inverse document frequency of word, determine each semantic description information respectively with each time Select the third degree of correlation of the category name of classification；

Wherein, the target of the semantic description information and the category name of the candidate classification shares word, is not included in the semanteme In shared word in each shared word of description information and the category name of candidate's classification in addition to itself.

11. according to the method described in claim 9, it is characterized by further comprising:

According to each semantic description information, the predetermined classification conjunctive word and each candidate classification of each candidate classification Predetermined classification conjunctive word and corresponding candidate classification predetermined correlation coefficient, determine each semantic description information respectively with each institute State the 4th degree of correlation of candidate classification；

It is described according to each third degree of correlation, determine that each keyword is related to the first of each candidate classification respectively Degree, comprising:

According to each third degree of correlation and each 4th degree of correlation, determine each keyword respectively with each candidate class First degree of correlation of purpose.

12. according to the method described in claim 8, it is characterized in that, described according to each semantic description information, each time Select the predetermined of the predetermined classification conjunctive word of classification and the predetermined classification conjunctive word of each candidate classification and corresponding candidate classification Related coefficient determines fourth degree of correlation of each semantic description information respectively with each candidate classification, comprising:

According to the predetermined classification conjunctive word of each candidate classification the second word frequency in each semantic description information, each respectively The predetermined classification association of second inverse document frequency of the predetermined classification conjunctive word of candidate's classification and each candidate classification The word predetermined correlation coefficient with corresponding candidate classification respectively, determine each semantic description information respectively with each candidate classification The 4th degree of correlation.

13. a kind of determining device of text classification, comprising:

First degree of correlation determining module, for according to each semantic description information, determine each keyword respectively with each candidate class First degree of correlation of purpose；

Second degree of correlation determining module, for the weight and each first degree of correlation according to each keyword, determine described in Text to be processed second degree of correlation with each candidate classification respectively；

Text classification determining module, for being determined from each candidate classification described to from according to each second degree of correlation Manage classification belonging to text.

14. a kind of computer readable storage medium is stored with computer program, when the computer program is executed by processor, So that the processor is executed such as the step of any one of claims 1 to 12 the method.

15. a kind of computer equipment, including memory and processor, the memory is stored with computer program, the calculating When machine program is executed by the processor, so that the processor is executed such as any one of claims 1 to 12 the method Step.