CN107491554A

CN107491554A - Construction method, construction device and the file classification method of text classifier

Info

Publication number: CN107491554A
Application number: CN201710779864.1A
Authority: CN
Inventors: 李德彦; 晋耀红; 席丽娜
Original assignee: Beijing Shenzhou Taiyue Software Co Ltd
Current assignee: Dingfu Intelligent Technology Co., Ltd
Priority date: 2017-09-01
Filing date: 2017-09-01
Publication date: 2017-12-19
Anticipated expiration: 2037-09-01
Also published as: CN107491554B

Abstract

The embodiment of the present application discloses a kind of construction method of text classifier, comprises the following steps：Taxonomic hierarchies is obtained, the taxonomic hierarchies is stored with multi-branch tree data structure, generates body tree；Keyword is extracted from this body node of the body tree；Body expression formula is obtained, the body expression formula generates according to classifying rules and semantic model, and the classifying rules generates according to the keyword and logical operator, and the semantic model generates according to the keyword；Described body node is established with the corresponding body expression formula and associated, obtains text classifier, the text classifier includes the body tree and the body expression formula with described each body node respective associated of body tree.Unknown text is classified using the text classifier constructed by the above method, accurately the text serious to characteristic crossover can be classified, meanwhile, avoid training corpus unbalanced the problem of causing classification to malfunction.

Description

Construction method, construction device and the file classification method of text classifier

Technical field

The application is related to Text Mining Technology field, more particularly to a kind of construction method of text classifier.In addition, this Shen It please further relate to a kind of construction device of text classifier, and a kind of file classification method.

Background technology

With the fast development of Internet resources, various texts quickly increase.Text include structured text and Non-structured text, the process of user's text message interested or useful is obtained from non-structured text, is referred to as text This excavation.Text classification is one kind important in Text Mining Technology.

Common text classification mainly uses statistical method, including k nearest neighbour methods, naive Bayesian method, neural network and Support vector machine method etc..Text classification based on statistical method, obtained respectively using the training corpus marked in advance to train The template of classification, template is recycled to classify unknown text.When text classification requirement is fine grit classification, classification and class There is identical feature in the language material content between not, that is, produce characteristic crossover phenomenon.When characteristic crossover phenomenon is more serious, just The precision of text classification can be significantly reduced.

The content of the invention

Existing text classifier is not applied for the serious text of characteristic crossover, to solve this technical problem, first Aspect, the application provide a kind of construction method of text classifier, comprised the following steps：

Taxonomic hierarchies is obtained, the taxonomic hierarchies is stored with multi-branch tree data structure, generates body tree；

Keyword is extracted from this body node of the body tree；

Body expression formula is obtained, the body expression formula generates according to classifying rules and semantic model, the classifying rules Generated according to the keyword and logical operator, the semantic model generates according to the keyword；

Described body node is established with the corresponding body expression formula and associated, obtains text classifier, the text Grader includes the body tree and the body expression formula with described each body node respective associated of body tree.

With reference in a first aspect, in first aspect in the first possible implementation, from this body node of the body tree The step of middle extraction keyword, including：

Descriptor is extracted from the title of this body node；

Expansion word is obtained according to the descriptor, obtains the keyword for including the descriptor and the expansion word.

With reference to first aspect and above-mentioned possible implementation, in second of possible implementation of first aspect, root The step of obtaining expansion word according to the descriptor includes：

Default sample text is segmented to obtain the first character；

Inverted index is built according to the first character, obtains index database；

The descriptor is segmented to obtain the second character；

Second character is matched with the index database；

The degree of correlation of sample text and descriptor is calculated according to the result of matching；

According to the degree of correlation, descending shows the sample text of the degree of correlation more than zero from large to small；

Highlight mark and the first character of second character match in the sample text of display；

Expansion word is obtained according to the character matched in the sample text of display with the descriptor part.

With reference to first aspect and above-mentioned possible implementation, in first aspect in the third possible implementation, profit The prediction tag along sort of default test text is determined with the text classifier；

When accuracy rate is less than predetermined threshold value, the body expression formula in the text classifier is adjusted, the accuracy rate is The ratio of prediction tag along sort sum is accounted for the quantity of the prediction tag along sort of the original classification tag match of test text.

With reference to first aspect and above-mentioned possible implementation, in the 4th kind of possible implementation of first aspect, adjust The step of body expression formula in whole grader, including：

Body expression formula of the extraction corresponding to the unmatched prediction tag along sort of original classification label；

When lacking constraint factor in corresponding body expression formula, increase constraint factor in body expression formula, obtain excellent The sheet of change

Body expression formula, the constraint factor include concept and/or logical operator in semantic model.

Second aspect, the application provide a kind of file classification method, comprised the following steps：

Obtain text to be sorted；

The body expression formula with the text matches to be sorted in text classifier is determined, wherein, the text classifier Including body tree, and the body expression formula with each body node respective associated in the body tree；

It is determined that this body node associated with the body expression formula；

Classification according to belonging to the information of this body node determines the text to be sorted.

With reference to second aspect, in second aspect in the first possible implementation, determine in text classifier with it is described The step of body expression formula of text matches to be sorted, includes：

When this body node association body expression formula it is more than one when, judge parallel the text to be sorted whether with body Expression formula matches.

The third aspect, the application provide a kind of text classifier construction device, including：

First acquisition unit, for obtaining taxonomic hierarchies, the taxonomic hierarchies is stored with multi-branch tree data structure, generation is originally Body tree；

Extraction unit, for extracting keyword from this body node of the body tree；

Second acquisition unit, for obtaining body expression formula, the body expression formula is according to classifying rules and semantic model Generation, the classifying rules generate according to the keyword and logical operator, and the semantic model generates according to the keyword；

Generation unit, associated for described body node to be established with the corresponding body expression formula, obtain text point Class device, the text classifier include the body tree and this body surface with each body node respective associated of body tree Up to formula.

With reference to the third aspect, in the third aspect in the first possible implementation, the extraction unit also includes：

Key phrases extraction subelement, for extracting descriptor from the title of this body node；

Subelement is extended, for obtaining expansion word according to the descriptor, obtains including the descriptor and the extension The keyword of word.

With reference to the third aspect and above-mentioned possible implementation, in second of possible implementation of the third aspect, text The construction device of this grader also includes：

Test text taxon, for determining the prediction contingency table of default test text using the text classifier Label；

Optimize unit, when being less than predetermined threshold value for accuracy rate, adjust the body expression formula in the text classifier, institute It is to account for prediction tag along sort sum with the quantity of the prediction tag along sort of the original classification tag match of test text to state accuracy rate Ratio.

Text classifier construction method and file classification method in above-mentioned technical proposal, firstly generate body tree, then from The body Node extraction keyword of body tree, is then based on keyword generative semantics model, is given birth to based on keyword and logical operator Composition rule-like, then with semantic model and classifying rules generation body expression formula, constructed body expression formula is corresponding This body node associate, so as to which body tree and all body expression formulas associated with this body node constitute text classification Device.When text classifier is used for into text classification, text to be sorted triggers specific body expression formula, due to body expression formula with Specific this body node association, therefore the body expression formula by being triggered can determine this body node.With this body node Information, such as title, text to be sorted is marked as tag along sort, determine the classification of text to be sorted.

Because body expression formula includes at least one concept that can be in the semantic model of Efficient Characterization text to be sorted, and And when the concept in multiple semantic models be present, identical or different logical relation be present between multiple semantic models, therefore, Even if the possible identical but associated body expression formula of the keyword extracted in this different body nodes is different, because This is applied to the classification that Feature Words intersect serious text.

Simultaneously as determine the classification of text by triggering body expression formula, it is not necessary to calculate feature covering quantity or Person's weight, therefore, even if training expectation is unbalanced, the Feature Words quantity of some classification is especially few, is also not in that feature tilts Cause the situation of text classification mistake.Because the Feature Words that can characterize text semantic are once extracted, and by For building body expression formula, then once triggering body expression formula, it is possible to treat classifying text and be marked, it is special without considering Quantity and weight that word occurs are levied, so as to avoid training from expecting the situation of uneven caused classification error.

Brief description of the drawings

In order to illustrate more clearly of the technical scheme of the application, letter will be made to the required accompanying drawing used in embodiment below Singly introduce, it should be apparent that, for those of ordinary skills, without having to pay creative labor, Other accompanying drawings can also be obtained according to these accompanying drawings.

Fig. 1 is the flow chart of one embodiment of the construction method of the application text classifier；

Fig. 2 be the application text classifier construction method second embodiment in step S200 flow chart；

Fig. 3 be the application text classifier construction method the 3rd embodiment in step S220 flow chart；

Fig. 4 is the flow chart of the 4th embodiment of the construction method of the application text classifier；

Fig. 5 be the application text classifier construction method the 5th embodiment in step S600 flow chart；

Fig. 6 is the flow chart of one embodiment of the application file classification method；

Fig. 7 is the structural representation of one embodiment of the construction device of the application text classifier；

Fig. 8 is the structural representation of second embodiment of the construction device of the application text classifier；

Fig. 9 is the structural representation of the 3rd embodiment of the construction device of the application text classifier.

Embodiment

Embodiments herein is elaborated below in conjunction with the accompanying drawings.

Text classification refers to given taxonomic hierarchies, and text is assigned in some or certain several classifications.Text classifier is The general designation for the method classified during text mining to text.

Taxonomic hierarchies includes the label of multiple levels, embodies in different application scene people for the specific of text classification Demand.Illustrated using bank credit card division customer service work order text as concrete application scene, taxonomic hierarchies can be such as the institute of table 1 Show, including first-level class label, be under the jurisdiction of the secondary classification label of first-level class label, and be under the jurisdiction of corresponding secondary classification mark The three-level tag along sort of label.Except the classification shown in table 1, the taxonomic hierarchies can also include other first-level class labels, And it is under the jurisdiction of the secondary classification label of corresponding first-level class label, also can be with it under the secondary classification label in taxonomic hierarchies His three-level tag along sort, the tag along sort of other ranks are similar.

The taxonomic hierarchies embodiment of table 1 illustrates table

Using the method for the text classification based on statistical method, following two defects at least be present.

First, when text classification requirement is fine grit classification, there is identical in the language material content between classification and classification Feature Words, that is, produce characteristic crossover phenomenon.

Illustrated using bank credit card division customer service work order text as concrete application scene, existing two texts to be sorted This：

Text 1 to be sorted：

There are xw puppets to emit list before：20150503nxxxxx181.Existing client sends a telegram here again to be discontented with to result, stakes out a claim me Row will also undertake a responsibility, and strong dissatisfaction is complained again, and verification is handled Wang Guibu as early as possible, thanks！Telephone number： 152xxxx4718。

Text 2 to be sorted：

Form NO. is pressed in caller client requirement：20150916nxxxxx311, it is desirable to handle as early as possible and accuse result Know, represent non-someone's contact so far, it is desirable to deduction and exemption loss, and require first to do dispute registration by 4900 yuan, only it is willing to that also it normally disappears The amount of money taken, it is reluctant also to be stolen the amount of money of brush, verification is handled Wang Guibu as early as possible, thanks！Telephone number：138xxxx8628.

In above-mentioned two text to be sorted, it is related that semanteme and puppet that text 1 to be sorted represents emit robber's brush, text 2 to be sorted The semanteme of expression is pressed related to business.However, there are many same or similar concepts simultaneously in two texts to be sorted Feature Words.For example, all occur " discontented ", " incoming call ", " verification " these Feature Words in two texts to be sorted；Further for example, For bank credit card division customer service work order, " stealing brush " in " puppet emits " and text to be sorted 2 in text 1 to be sorted It is considered as similar Feature Words.In the two texts to be sorted, can effectively it characterize belonging to the text reality to be sorted The Feature Words of classification are relatively fewer, such as：" puppet emits list " in " puppet emit steal brush ", in " pressing business " " it is required that locating as early as possible Reason ", " non-someone's contact so far ".

During being classified using the method for the text classification based on statistical method, because above-mentioned two is to be sorted Text can extract many same or analogous Feature Words, and characteristic crossover is serious, be actually difficult or can not effectively extract Feature Words as similar " it is required that handling as early as possible ", " non-someone's contact so far ".In face of of this sort training corpus, meter is used The statistical classification method that calculation machine learns automatically, due to being easy to misjudge, therefore text classifier is extremely difficult to preferable precision and wanted Ask.

Second, when training expects uneven, the training corpus of partial category is very more, and the feature of extraction is a lot, covering Wide, the training corpus of partial category is considerably less, and extraction feature is limited, is not enough to cover all aspects of current class.Now, The problem of text classification easily causes feature to tilt is carried out using statistical method.

Still illustrated using bank credit card division customer service work order text as concrete application scene, continue to use foregoing treat point Class text 2, in the text to be sorted, " it is required that as early as possible handle ", " non-someone's contact so far " etc. can Efficient Characterization press industry The Feature Words for concept of being engaged in are difficult to be extracted to；Meanwhile " being reluctant also to be stolen the amount of money of brush " in text 2 to be sorted, easily extraction Go out Feature Words " stealing brush ".Therefore, in text 2 to be sorted, being capable of the Feature Words of Efficient Characterization can not be extracted, misleading spy Sign word is extracted, and so as to easily trigger the problem of erroneous judgement, causes classification to malfunction.

In addition, if when text classifier is built, the training corpus of " pressing business " this classification is seldom, Cong Zhongti The Feature Words taken are limited, only " pressing ", " steal brush ", " amount of money ", " verification ", " processing " totally 5 Feature Words；And simultaneously, " puppet emits The training corpus of this classification of robber's brush " is a lot, and the Feature Words coverage rate therefrom extracted can extract " letter than broad With card ", " amount of money ", " carrying volume ", " amount ", " steal brush ", " incoming call ", " discontented ", " verification ", " processing ", " responsibility ", " refund ", " complaint ", " progress ", " accepting " totally 14 Feature Words.

When in face of following text 3 to be sorted, text 3 to be sorted actually belongs to " pressing business " classification, and is based on The file classification method of statistical method is easy to judge by accident.

Text 3 to be sorted：

Complain before single：20150826j00000044,20150902j00000248,20150910j00000149, client Represent to complain from August incoming call on the 26th, credit card application carries volume, and staff's refusal is accepted, and is not connected to branch responsible person so far Ringing back, customer requirement is handled ability for branch and complained, and repeatedly inquires mechanism of supervision department phone in the phone, And require regardless of result, it is desirable to which processing progress is replied by branch, and client represents processing time too long, is reluctant to continue Treat, the tired processing of labor, thanks.

File classification method based on statistical method can determine to classify according to the quantity and weight of extraction feature.If adopt Classified with statistical method, text 3 to be sorted can be marked as " puppet emits robber's brush " classification.Because " puppet emits robber's brush " class covers More features word, and " pressing business " category feature word is few, coverage rate is limited, and can not more preferably match text 3 to be sorted Content.

Existing text classifier is not applied for the serious text of characteristic crossover, and is not applied for training corpus not Uniform situation, to solve this technical problem, Fig. 1 is refer to, a kind of text is provided in the embodiment of the application The construction method of this grader, comprises the following steps：

S100 obtains taxonomic hierarchies, stores the taxonomic hierarchies with multi-branch tree data structure, generates body tree；

S200 extracts keyword from this body node of the body tree；

S300 obtains body expression formula, and the body expression formula generates according to classifying rules and semantic model, the classification Rule generates according to the keyword and logical operator, and the semantic model generates according to the keyword；

Described body node is established and associated by S400 with the corresponding body expression formula, obtains text classifier, described Text classifier includes the body tree and the body expression formula with described each body node respective associated of body tree.

In the step of S100, taxonomic hierarchies can also can be built by artificial constructed by computer, the application to this not It is restricted.The step of " taxonomic hierarchies being stored with multi-branch tree data structure " in the step of S100, can specifically use with Under type：Root node is initially set up, using root node as father node, this body node of addition one-level, and with the one-level in taxonomic hierarchies Title of the tag along sort as corresponding this body node of one-level；Similar, then using this body node of one-level as father node, add two level This body node, and the title using the secondary classification label in taxonomic hierarchies as corresponding this body node of two level；By that analogy, directly This body node of the tag along sort of all ranks all correspondence establishments into the taxonomic hierarchies of acquisition.At different levels body nodes and corresponding Set membership between this body node, just constitute body tree.In body tree, one-level this body node, two level this body node, three This body node of level etc. may be referred to collectively as this body node.

For example, continue to use with the example in upper table 1, the taxonomic hierarchies, the sheet of generation are stored with multi-branch tree data structure Body tree is as shown in table 2.

The body tree example of table 2 illustrates table

Alternatively, the step of extraction keyword, can include from this body node of body tree in S200：Obtain body The title of node；The title of this body node is segmented, obtains descriptor；This body node is used as using these descriptor Keyword.

For example, the example in table 2 is continued to use, by taking the three-level of entitled " puppet emit steal brush " this body node as an example, the body Entitled " puppet emits robber's brush " of node, segments to " puppet emits robber's brush ", obtains descriptor " puppet emits " and " stealing brush ".With " puppet emits " " stealing brush " is used as keyword, the step of carrying out obtaining body expression formula in next step.

Alternatively, refer to Fig. 2, in S200 the step of keyword is extracted from this body node of body tree, can wrap Include：

S210 extracts descriptor from the title of this body node；

S220 obtains expansion word according to the descriptor, obtains including the descriptor and the key of the expansion word Word.

By obtaining expansion word, more implicit close semantic expansion word can be excavated, with descriptor and expansion word Classifying rules and semantic model are built collectively as keyword, so as to improve the nicety of grading of text classifier.

Wherein, the method that descriptor is extracted in step S210 may refer to extract the side of descriptor in foregoing implementation Method.

Refer to Fig. 3, the step of expansion word is obtained according to descriptor in step S220, can include：

S221 is segmented default sample text to obtain the first character；

S222 builds inverted index according to the first character, obtains index database；

S223 is segmented the descriptor to obtain the second character；

S224 matches the second character with the index database；

S225 calculates the degree of correlation of sample text and descriptor according to the result of matching；

According to the degree of correlation, descending shows the sample text of the degree of correlation more than zero to S226 from large to small；

S227 highlight mark and first character of second character match in the sample text of display；

The character that S228 is matched in the sample text according to display with the descriptor part obtains the first expansion word.

Step S221 and S222 utilize specific sample text, the index database based on specific sample text are established, for profit The first expansion word is obtained with the index database.For example, 10000, bank credit card division customer service work order text is obtained in advance As sample text, this 10000 sample texts segmented according to monocase granularity, obtain the first character.To every Individual first character word for word builds inverted index, forms index database.

In the step of S223, according to and same segmenting method the step of S221, the theme extracted to S210 steps Word is segmented, and obtains the second character.

In the step of S224, the second character is word for word matched with the inverted index in index database, every sample text In this, the character of matching is more, it is believed that the degree of correlation of the sample text and descriptor is higher, specifically can be by matching character Length calculate the degree of correlation of the sample text and descriptor.

In the step of S228, the first character for being matched in sample text with descriptor part extends backward or forward, So as to obtain the character string of the word level with complete meaning, using the character string as the first expansion word.Except the feelings of part matching It outside condition, even if being matched completely with descriptor, can also forward or backward be extended herein in sample, obtain the word with complete meaning Symbol string is used as the first expansion word.The step can be completed, the application does not make to this by being accomplished manually by computer Limitation.

Step S223 to S228, using descriptor as the inverted index in term, with index database in specific index database Matched, so as to calculate the degree of correlation of sample text and descriptor, sample text is subjected to descending sort according to the degree of correlation It has been shown that, and highlight mark and the first character of the second character match in the sample text data of display, carry out visualization exhibition Show, so as to assist fast positioning to obtain the first expansion word.Especially when sample text content is very long, or the degree of correlation is more than zero sample When this amount of text is very big, if obtaining the first expansion word according to the first highlighted character by artificial, acquisition is greatly promoted The efficiency of first expansion word, reduce workload.

For example, the example in table 2 is continued to use, extracts descriptor " puppet emits " and " stealing brush ".Respectively with " puppet emits " and " robber Brush " is used as term, using the index database constructed by foregoing 10000 sample texts, the letter of position matching in sample text Cease content.If result shows that sample text of the degree of correlation more than 0 has 3, it is specific as follows shown in.

Sample text 1：

Sample text 2：

Customers' responsiveness is not applied for card, and is merchandised, and is referred to puppet and is emitted list：20150207j11000092, during which client is multiple Incoming call is pressed, and is shown in Table list：20150209j23240075,20150210j23240017,20150211j23240055, existing client Send a telegram here again, after representing reflection problem so far and repeatedly pressing, 2/11 be connected to when branch is replied simply inquiry card whether I Application, any reply on result all do not have, and it is very discontented to manage it processing progress for me, it is desirable to replys most terminate as early as possible Fruit, or reply and inform the accurate process limited, it is invalid that my portion pacifies online, please the reply processing of your, thanks！

Sample text 3：

Customer complaint card is stolen problem, has filled out list：20150708j00000081,20150714j00000214 are right Currently processed result is discontented with, and it is the pseudo- proof for emitting short message to still need to me and manage it provide the short message, and requires to handle as early as possible, hopes your portion assist Processing, thanks！

" puppet emits " this descriptor is all matched in the text of sample text 1 and 2 completely.And sample text 3 is except complete Match " puppet emits " outside, also partly match " robber ".Therefore, " puppet emits " can be extended to the right " puppet emits short message ", by " robber " " usurping " is extended to the right, so as to which " puppet emits short message " and " usurping " is used as into the first expansion word.By the first expansion word and descriptor Together as keyword, the step of for obtaining body expression formula in next step.

Except using index database obtain the first expansion word in addition to, can also according to the semanteme of descriptor come obtain character can not Match, but semantic the second same or like expansion word, with the first expansion word, the second expansion word and descriptor collectively as pass Keyword, the step of for obtaining body expression formula in next step.

For example, from above-mentioned sample text 2 it can be found that in the sample text, even if occur without " puppet emits " this Individual word, but when " not applied for card " in the text and " transaction " while when occurring, the content of the text remains on emits robber's brush with puppet Correlation, user, which is also desirable that, is categorized into the text under the classification of " puppet, which emits, steals brush ".Therefore, it " will can not apply for card " and " hand over Easily " it is used as the second expansion word.

The step of by obtaining the second expansion word according to descriptor, implicit expansion word can be further excavated, so as to Further improve the nicety of grading of text classifier.The acquisition of second expansion word can also pass through calculating by manually carrying out Machine obtains, and the application is not construed as limiting to this.

In the step of S300, body expression formula can also can be generated, the application by artificial constructed generation by computer This is not restricted.It can be that body expression formula manually is inputted into a certain computer to obtain body expression formula, obtained by the computer Take or the computer receives the body expression formula generated and sent by another computer, so as to complete to obtain this body surface The step of up to formula, the application is not also restricted to this.

The generating process of body expression formula, it can specifically be realized by following steps：

First, the keyword extracted according to the step of S200, at least one keyword is connected using logical operator Come, make logic association between logical operator and keyword, keyword and keyword be present, generate classifying rules.

Logical operator, also known as logical operator, the logical operator in the embodiment of the present application include：Logical AND "+", logic Non- "-", logic or " | ", and polynary round " () ".For example, classifying rules A+B, represent to require simultaneously comprising A and B；Classifying rules is A+ (B | C), represents to require comprising any one in B or C, while also requires to include A.

The example of " puppet emit steal brush " is continued to use in S200, the master extracted from the three-level of entitled " puppet, which emits, steals brush " this body node Write inscription as " puppet emits " and " stealing brush ", the first expansion word obtained by descriptor is " puppet emits short message " and " usurping ", and second extends Word is " not applying for card " and " transaction ".By descriptor and two class expansion words collectively as keyword, 3 classification gauges are generated Then：

Classifying rules 1：Puppet emits | steal brush；

Classifying rules 2：Puppet emits short message+usurp；

Classifying rules 3：Do not apply for card+merchandise.

Above-mentioned classifying rules can build generation by artificial, can also be generated by computer, the application to this not It is construed as limiting.

Secondly, according to keyword generative semantics model.Semantic model refers to, towards known concept, conclude from sample data What exhaustion went out is used to describe the semantic text presentation form of known concept.

Specifically, in one implementation, semantic model can include in all-purpose language concept and business factor concept Any one, respectively with " c_ " and " e_ " two kinds of sign flags.Keyword is divided into all-purpose language concept and business factor is general Read, respectively from each keyword, the different expression form of extraction known concept from context text message.

For example, still continue to use in S200 the example of " puppet emit steal brush ", will " puppet emits ", " stealing brush ", " puppet emits information ", " negative concept " is used as all-purpose language concept, from by " usurping ", " applying for card ", " use card " respectively as business factor concept Sample data in conclude exhaustion go out under specific concept, represent the different text presentation forms of the concept.It is as shown in table 3 below：

The semantic model example one of table 3

Concept type	Concept	The different expression form (Feature Words) of concept
			Business factor concept	E_ puppets emit	Puppet emits, palms off, pretended to be
Business factor concept	E_ steals brush	Steal brush
			Business factor concept	E_ puppets emit information	Puppet emits short message, puppet emits message, puppet emits incoming call, puppet emits mail
Business factor concept	E_ is usurped	Usurp
			Business factor concept	E_ applies for card	Do { 0,2 } cards
Business factor concept	E_ cards	Brushed with card, generation transaction, using card, card
			All-purpose language concept	C_ negates concept	Not, do not have, never

In semantic model, in addition to including all-purpose language concept and/or the class of business factor concept two, it can also include i.e. With concept, i.e., with concept be user according to the genus being actually needed to set immediately, can be marked with " k_ " symbol.Example Such as, need " initial amount " occur in text when classification, can directly define an instant concept, be come with " the initial amounts of k_ " Represent, should use in concept and only include " initial amount " word.

Above-mentioned semantic model can build generation by artificial, can also build generation, the application couple by computer This is not construed as limiting.

Finally, body expression formula is generated according to classifying rules and semantic model.Specifically, by the key in classifying rules Word correspond in semantic model it is corresponding conceptive, and using with identical logical operator in classifying rules by corresponding concepts with patrolling Collect operator to associate, generate body expression formula.

For example, the example of foregoing " puppet, which emits, steals brush " is continued to use, following body expression formula can be generated：

Body expression formula 1：E_ puppets emit | and e_ steals brush；

Body expression formula 2：E_ puppets emit information+e_ and usurped；

Body expression formula 3：C_ negative concept+e_ apply for card+e_ with card.

It should be noted that in the step of S300, classifying rules and semantic model can generate simultaneously, can also be successively Generation, the application are not restricted to its genesis sequence.

In the step of S400, this body surface for will being generated in S300 steps based on some this body node in S200 steps Up to formula, establish and associate with this body node in S200 steps.One this body node can sheet corresponding with one or more Body expression formula establishes association.After this body node all in body tree is all established with respective body expression formula to be associated, The body tree, and the body expression formula with each body node respective associated of body tree, collectively form text classifier, are used for The classification of unknown text.

Text classifier construction method and file classification method in above-mentioned embodiment, firstly generate body tree, then from The body Node extraction keyword of body tree, is then based on keyword generative semantics model, is given birth to based on keyword and logical operator Composition rule-like, then with semantic model and classifying rules generation body expression formula, constructed body expression formula is corresponding This body node associate, so as to which body tree and all body expression formulas associated with this body node constitute text classification Device.When text classifier is used for into text classification, text to be sorted triggers specific body expression formula, due to body expression formula with Specific this body node association, therefore the body expression formula by being triggered can determine this body node.With this body node Information, such as title, text to be sorted is marked as tag along sort, determine the classification of text to be sorted.

Simultaneously as it is to determine the classification of text by triggering body expression formula using above-mentioned text classifier, it is not necessary to The quantity or weight of feature covering are calculated, therefore, even if training expectation is unbalanced, the Feature Words quantity of some classification is special It is few, be also not in that feature tilts the situation for causing text classification mistake.Because the Feature Words of text semantic can be characterized Once being extracted, and it be used to build body expression formula, then once triggering body expression formula, it is possible to treat classifying text Classification is marked, without considering the number and weight of Feature Words appearance, so as to avoid training from expecting uneven caused classification The situation of error.

For example, text to be sorted 1, text to be sorted 2 in the shortcomings that continuing to use above-mentioned statistical method part and treat The example of classifying text 3, by three-level this body node association body expression formula " k_ puppets emit list " and " e_ of entitled " puppet, which emits, steals brush " Puppet emits+c_ and requires load duty | and c_ is discontented with ", it is possible to by by triggering the body expression formula in text classifier, so as to will be to be sorted Text 1 is categorized into " puppet emits robber's brush " classification.Similarly, the three-level of entitled " pressing business " this body node association body expression Formula " e_ is pressed ", " c_ inquiry+e_ processing progress " and " e_ does not have reply+e_ processing times+c_ length ", can be by text to be sorted 2nd, text 3 to be sorted is identified in " pressing business " classification.

Wherein, semantic model is as shown in table 4.

The semantic model example two of table 4

It should be noted that in table 2, " undertaking { 0,2 } responsibility " is represented in matched text, if undertake with responsibility it Between the also text comprising 0~2 character, also " { 0,2 } responsibility can be undertaken " and be matched, for example, " undertaking one when having in text Severely rebuke and appoint " or when " undertake and had a responsibility for ", just will be considered that and matched " undertaking { 0,2 } responsibility ".Other in the application are similar Method for expressing implication is identical with this.

In table 2, " [^ is not] { 0,5 } discontented " is represented in matched text, as long as including 0~5 character before discontented Text, it can all be matched by " [^ is not] { 0,5 } is discontented ", such as " very discontented ", while exclude " not being discontented ", " being not very discontented with " Deng reverse semantic Feature Words.The similar method for expressing implication of other in the application is identical with this.

Alternatively, Fig. 4 is refer to, the construction method of text classifier can also include：

S500 determines the prediction tag along sort of default test text using the text classifier；

S600 adjusts the body expression formula in the text classifier when accuracy rate is less than predetermined threshold value, described accurate Rate is that the ratio of prediction tag along sort sum is accounted for the quantity of the prediction tag along sort of the original classification tag match of test text.

In the step of S500, default test text is good using original classification label handmarking, generally test The quantity of text more than one, belong to same class text with sample text.For example, sample text is bank credit card division client clothes Work single text, then test text is generally also bank credit card division customer service work order text.

In the step of S600, if accuracy rate is more than or equal to predetermined threshold value, represent that text grader can be effectively Unknown text is classified.If less than predetermined threshold value, then optimize text grader by adjusting body expression formula.

The step of S500 and S600 can be the process of an iteration, by continuing to optimize, the text after making optimization The accuracy rate of grader can reach the desired threshold value of user.

It refer to Fig. 5, can specifically include the step of S600：

S610 extracts the body expression formula corresponding to the unmatched prediction tag along sort of original classification label；

S620 increases constraint factor in body expression formula, obtained when lacking constraint factor in corresponding body expression formula To the body expression formula of optimization, the constraint factor includes concept and/or logical operator in semantic model.

In the step of S610, can specifically be accomplished by the following way, first extraction with original classification label not The prediction tag along sort matched somebody with somebody, title and this body node of default tag along sort identical are found, then determine to preset tag along sort with this The body expression formula of association, the body expression formula being triggered to so as to exact p-value text.

In the step of S620, when the body expression formula extracted in S610 steps lacks constraint factor, Ke Yi Increase constraint factor in body expression formula, that is, by adding business factor concept, all-purpose language concept or using in concept At least one, and logical operator, so as to optimize script body expression formula, the enabled text to be sorted matched with script body expression formula Originally the body expression formula after matching optimization, or the text energy matching optimization to be sorted that can not be matched with script body expression formula are unable to Body expression formula afterwards.For example, new concept can be increased in semantic model, while increase new logical operator, so as to raw Into the body expression formula after optimization；Feature Words can also be increased or decreased in former concept；Logic calculation can also be increased or decreased Son, make to form new logical relation between concept.Script body expression formula is replaced with the body expression formula of the optimization, and in body tree This corresponding body node establishes association, the text classifier after being optimized.

For example, the example of foregoing " puppet, which emits, steals brush " is continued to use, test text 1 is included in test text.

Test text 1：

Client emits single result for puppet and is discontented with, it is desirable to complains, refers to form NO.：20150810s11000063.Exist Line is filled in for it newly complains single 20150906s00000076 at risk management.But client stakes out a claim and goes to Hubei Wuhan The credit card centre in city solves this problem face to face, and asking your portion, verification is handled as early as possible, thanks.Telephone number：138xxxxx124.

The test text 1 can match with body expression formula " e_ puppets emit | e_ steal brush ", that is, trigger the body expression formula, According to the body expression formula, this body node being associated is determined in body tree, and according to the title of this body node, this is surveyed Examination text is identified with prediction tag along sort " puppet emits robber's brush ".Robber's brush is emitted however, the 1 actual semanteme of prediction text is not puppet, But business is pressed, the original classification label of manual identification is " pressing business ", is mismatched with prediction tag along sort.If such It was found that being emitted in the test text 1 if there is puppet or stealing brush, while " result " is occurred without, then the test text 1 will not touch The body expression formula for sending out above-mentioned, therefore, lack constraint factor in current body expression formula：Logical operator "-" and business factor Concept " e_ results ", " e_ results " include Feature Words " result ".Increase the pact lacked in body expression formula Shu Yinzi, the body expression formula optimized：E_ puppets emit | and e_ steals brush-e_ results.Replaced with the body expression formula of the optimization Script body expression formula, this body node corresponding with body tree establish association, the text classifier after being optimized.

Fig. 6 is refer to, in another embodiment, there is provided a kind of file classification method, comprise the following steps：

S710 obtains text to be sorted；

S720 determines the body expression formula with the text matches to be sorted in text classifier, wherein, the text point Class device includes body tree, and the body expression formula with each body node respective associated in the body tree；

S730 determines this body node associated with the body expression formula；

Classifications of the S740 according to belonging to the information of this body node determines the text to be sorted.

In the step of S720, body tree is stored in the form of multi-branch tree data structure.In a body tree, a body Node can associate at least one body expression formula.When the body expression formula of this body node association is more than one, multiple bodies Expression formula forms body expression formula collection, whether can judge text to be sorted by way of traveling through body expression formula collection one by one Matched with body expression formula therein；It can also judge whether text to be sorted matches with body expression formula parallel, so as to improve Matching speed, especially when amount of text to be sorted is larger, text classification speed can be improved on the whole.

The information of this body node, can be specifically title of this body node etc. in the step of S740.With in S730 with this The name of this body node of body expression formula association is referred to as tag along sort, marks text to be sorted, is carried out so as to treat classifying text Classification.When same text to be sorted is triggered more than a body expression formula, and a plurality of body expression formula each corresponds to This body node difference when, tag along sort can be referred to as with the name of multiple body nodes, mark same text to be sorted respectively This, reaches polytypic effect.

Fig. 7 is refer to, in another embodiment, there is provided a kind of text classifier construction device, including：

First acquisition unit 1, for obtaining taxonomic hierarchies, the taxonomic hierarchies, generation are stored with multi-branch tree data structure Body tree；

Extraction unit 2, for extracting keyword from this body node of the body tree；

Second acquisition unit 3, for obtaining body expression formula, the body expression formula is according to classifying rules and semantic model Generation, the classifying rules generate according to the keyword and logical operator, and the semantic model generates according to the keyword；

Generation unit 4, associated for described body node to be established with the corresponding body expression formula, obtain text point Class device, the text classifier include the body tree and this body surface with each body node respective associated of body tree Up to formula.

Alternatively, the step of refer to Fig. 8, generating body expression formula can be carried out by outer computer or manually, now, After extraction unit is extracted keyword, keyword is sent.Outer computer or manually according to keyword next life constituent class Rule and semantic model, and body expression formula is generated according to semantic model and classifying rules.Then received by second acquisition unit The body expression formula of outside input, finally constructs text classifier by generation unit.In such a case, it is possible to reduce text The amount of calculation of grader construction device in itself.

Alternatively, Fig. 9 is refer to, extraction unit 2 can include：

Key phrases extraction subelement 21, for extracting descriptor from the title of this body node；

Subelement 22 is extended, for obtaining expansion word according to the descriptor, obtains including the descriptor and the expansion Open up the keyword of word.

Expansion word is obtained by extending subelement, more implicit close semantic expansion word can be excavated, with theme Word and expansion word build classifying rules and semantic model collectively as keyword, so as to improve the classification of text classifier essence Degree.

Alternatively, Fig. 9 is refer to, text classifier construction device can also include：

Test text taxon 5, for determining that the prediction of default test text is classified using the text classifier Label；

Optimize unit 6, when being less than predetermined threshold value for accuracy rate, adjust the body expression formula in the text classifier, The accuracy rate is to account for predict that tag along sort is total with the quantity of the prediction tag along sort of the original classification tag match of test text Several ratio.

By continuing to optimize, the accuracy rate of the text classifier after alloing optimization reaches the desired threshold value of user.

Alternatively, extension subelement 22 can include：

First participle unit, for being segmented default sample text to obtain the first character；

Index database construction unit, for building inverted index according to the first character, obtain index database；

Second participle unit, for being segmented the descriptor to obtain the second character；

Matching unit, for the second character to be matched with the index database；

Correlation calculating unit, for calculating the degree of correlation of sample text and descriptor according to the result of matching；

Display unit, for descending to show the sample text of the degree of correlation more than zero from large to small according to the degree of correlation；

Highlighted unit, for the highlight mark in the sample text of display and the first character of second character match；

First expansion word acquiring unit, for the character matched in the sample text according to display with the descriptor part Obtain expansion word.

Alternatively, optimization unit 6 can include：

Body expression formula extraction unit, for corresponding to the unmatched prediction tag along sort of extraction and original classification label Body expression formula；

Adjustment unit, for when lacking constraint factor in corresponding body expression formula, increasing about in body expression formula Shu Yinzi, the body expression formula optimized, the constraint factor include concept and/or logical operator in semantic model.

In this specification between each embodiment identical similar part mutually referring to.Invention described above is real The mode of applying is not intended to limit the scope of the present invention..

Claims

1. a kind of construction method of text classifier, it is characterised in that comprise the following steps：

Keyword is extracted from this body node of the body tree；

Obtain body expression formula, the body expression formula generates according to classifying rules and semantic model, the classifying rules according to Keyword and the logical operator generation, the semantic model generate according to the keyword；

Described body node is established with the corresponding body expression formula and associated, obtains text classifier, the text classification Device includes the body tree and the body expression formula with described each body node respective associated of body tree.

2. the construction method of text classifier according to claim 1, it is characterised in that from this body segment of the body tree The step of keyword is extracted in point, including：

Descriptor is extracted from the title of this body node；

3. the construction method of text classifier according to claim 2, it is characterised in that obtained and expanded according to the descriptor The step of opening up word includes：

Default sample text is segmented to obtain the first character；

The descriptor is segmented to obtain the second character；

Second character is matched with the index database；

4. the construction method of text classifier according to claim 1, it is characterised in that also include：

The prediction tag along sort of default test text is determined using the text classifier；

When accuracy rate is less than predetermined threshold value, the body expression formula in the text classifier is adjusted, the accuracy rate is and survey The quantity for trying the prediction tag along sort of the original classification tag match of text accounts for the ratio of prediction tag along sort sum.

5. the construction method of text classifier according to claim 4, it is characterised in that this body surface in adjustment grader The step of up to formula, including：

When lacking constraint factor in corresponding body expression formula, increase constraint factor in body expression formula, optimized Body expression formula, the constraint factor include concept and/or logical operator in semantic model.

6. a kind of file classification method, it is characterised in that comprise the following steps：

Obtain text to be sorted；

The body expression formula with the text matches to be sorted in text classifier is determined, wherein, the text classifier includes Body tree, and the body expression formula with each body node respective associated in the body tree；

7. file classification method according to claim 6, it is characterised in that determine in text classifier with it is described to be sorted The step of body expression formula of text matches, includes：

When the body expression formula of this body node association is more than one, judge whether the text to be sorted expresses with body parallel Formula matches.

A kind of 8. text classifier construction device, it is characterised in that including：

First acquisition unit, for obtaining taxonomic hierarchies, the taxonomic hierarchies is stored with multi-branch tree data structure, generates body Tree；

Extraction unit, for extracting keyword from this body node of the body tree；

Second acquisition unit, for obtaining body expression formula, the body expression formula generates according to classifying rules and semantic model, The classifying rules generates according to the keyword and logical operator, and the semantic model generates according to the keyword；

Generation unit, associated for described body node to be established with the corresponding body expression formula, obtain text classifier, The text classifier includes the body tree and the body expression formula with described each body node respective associated of body tree.

9. text classifier construction device according to claim 8, it is characterised in that the extraction unit also includes：

Subelement is extended, for obtaining expansion word according to the descriptor, obtains including the descriptor and the expansion word The keyword.

10. text classifier construction device according to claim 8, it is characterised in that also include：

Test text taxon, for determining the prediction tag along sort of default test text using the text classifier；

Optimize unit, when being less than predetermined threshold value for accuracy rate, adjust the body expression formula in the text classifier, the standard True rate is that the ratio of prediction tag along sort sum is accounted for the quantity of the prediction tag along sort of the original classification tag match of test text Value.