CN109241392A

CN109241392A - Recognition methods, device, system and the storage medium of target word

Info

Publication number: CN109241392A
Application number: CN201710538781.3A
Authority: CN
Inventors: 易鸣; 汤俊杰; 崔志刚; 贺宇凯; 王峰; 李刚
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2017-07-04
Filing date: 2017-07-04
Publication date: 2019-01-18

Abstract

The present invention relates to field of information processing, in particular to a kind of recognition methods of target word, device, system and storage medium.A kind of recognition methods of target word provided by one embodiment of the present invention carries out participle division based on candidate string of the text data of minimum particle size to acquisition, obtains word segmentation result；Calculate at least one specific characteristic value of each candidate participle in the word segmentation result；At least one described specific characteristic value is inputted into default classifier, obtains the decision content of the string of candidate corresponding to the specific characteristic value；The string of candidate corresponding to the decision content for meeting preset condition is set as target word.This programme embodiment identifies target word by calculating at least one characteristic value and the combining classification device of candidate string, relative to directly artificial settings threshold value and simple linear analysis method, accuracy rate has with recall rate significantly to be promoted, artificial screening cost is largely reduced, the recognition efficiency of target word is improved.

Description

Recognition methods, device, system and the storage medium of target word

[technical field]

The present invention relates to field of information processing, in particular to a kind of recognition methods of target word, device, system and storage are situated between Matter.

[background technique]

In recent years, the fast development with internet in the world, the information that people face are in exponential increase.In people Have a large amount of neologisms, such as movie and television play name, trade name, network popular word in the information that is faced.Therefore, how accurate quick Automatic discovery neologisms be just particularly important.

The method that existing new word discovery is mainly based upon rule, main thought are according to the word-building characteristic of neologisms or outer Type feature establishes rule base, specialized dictionary or pattern base, then finds neologisms by rule match.Rule-based method is main Disadvantage is the problem for being confined to some field, and needing to establish rule base etc., and have recall rate inadequate, cannot identify Target neologisms under current complicated internet environment.

[summary of the invention]

One object of the present invention aims to solve the problem that at least one above-mentioned problem, provide a kind of target word recognition methods, Device, system and storage medium.

To realize the purpose, the present invention adopts the following technical scheme:

An embodiment provides a kind of recognition methods of target word comprising:

Participle division is carried out based on candidate string of the text data of minimum particle size to acquisition, obtains word segmentation result；

Calculate at least one specific characteristic value of each candidate participle in the word segmentation result；

At least one described specific characteristic value is inputted into default classifier, obtains the string of candidate corresponding to the specific characteristic value Decision content；

The string of candidate corresponding to the decision content for meeting preset condition is set as target word.

Specifically, described carry out participle division based on candidate string of the text data of minimum particle size to acquisition, segmented As a result the step of, comprising:

Participle division is carried out based on candidate string of the text data of minimum particle size to acquisition, obtains candidate participle；

At least one described candidate participle is combined, the corresponding word segmentation result of the candidate string is obtained.

Further, the specific characteristic value includes at least one of:

Foundation characteristic, statistics category feature, incidence coefficient category feature, context category feature.

Specifically, the foundation characteristic includes at least one of: candidate's string inquiry times, are waited at candidate string participle mode The ratio of the mean search frequency, candidate string inquiry times that the maximum value of choosing participle length, candidate segment.

Specifically, the statistics category feature includes at least one of: tightness, joint probability, conditional probability, anti-condition Probability, point mutual information, second order point mutual information, log-likelihood ratio, normalization expectation.

Specifically, the incidence coefficient category feature includes at least one of: probability ratio, increment, Jie Kade distance, Xin Pu Gloomy distance.

Specifically, the context category feature includes at least one of: context entropy, context diversity, left information Entropy, right comentropy, left adjacent diversity, right adjacent diversity.

Optionally, the default classifier includes the progressive decision tree classifier of gradient.

The decision content for meeting preset condition, comprising: meet the decision content of predetermined threshold value range, and the pre- gating Limits are corresponding at least one described specific characteristic value.

Further, the recognition methods further includes being based on default extracting rule orientation by search engine to obtain candidate String.

Further, the recognition methods further includes carrying out data purification pretreatment to the candidate string.

Another embodiment of the invention provides a kind of identification device of target word comprising:

Word segmentation module carries out participle division for candidate string of the text data based on minimum particle size to acquisition, is divided Word result；

Characteristic value calculating module, for calculating at least one specific characteristic of each candidate participle in the word segmentation result Value；

It is specified to obtain this at least one described specific characteristic value to be inputted default classifier for decision content computing module The decision content of the string of candidate corresponding to characteristic value；

Selecting module is set as target word for will meet the string of candidate corresponding to the decision content of preset condition.

Specifically, the word segmentation module carries out participle division based on candidate string of the text data of minimum particle size to acquisition, Obtain word segmentation result, comprising:

Specifically, the specific characteristic value includes at least one of:

Specifically, the context category feature includes at least one of: context entropy, context diversity, left information Entropy, right comentropy, left adjacent diversity, right adjacent diversity.Optionally, the default classifier includes the progressive decision of gradient Tree Classifier.

Specifically, the decision content for meeting preset condition, comprising: meet the decision content of predetermined threshold value range, and institute It is corresponding at least one described specific characteristic value to state predetermined threshold value range.

Further, the identification device further includes that candidate string obtains module,

The candidate string obtains module, obtains candidate string for being based on default extracting rule orientation by search engine.

Further, the candidate string obtains module and is also used to: carrying out data purification pretreatment to the candidate string.

Yet another embodiment of the present invention provides a kind of identifying system of target word comprising:

The identification device of target word carries out participle for candidate string of the text data based on minimum particle size to acquisition and draws Point, obtain word segmentation result；Calculate at least one specific characteristic value of each candidate participle in the word segmentation result；By described at least One specific characteristic value inputs default classifier, obtains the decision content of the string of candidate corresponding to the specific characteristic value；It will meet pre- If the string of candidate corresponding to the decision content of condition is set as target word；

Training sample identification device is arranged needed for the parameter of classifier for providing to the identification device of the target word Training sample word；

Target word collection device, the target word that the identification device for receiving the target word is identified.

Another embodiment of the present invention provides a kind of storage mediums, which is characterized in that the storage medium includes depositing The program of storage, wherein the step of system where controlling the storage medium in described program operation executes any preceding method.

Compared with prior art, the present invention has following advantage:

The recognition methods of a kind of target word provided by one embodiment of the present invention, based on the text data of minimum particle size to obtaining The candidate string taken carries out participle division, obtains word segmentation result；Calculate at least one each candidate segmented in the word segmentation result Specific characteristic value；At least one described specific characteristic value is inputted into default classifier, obtains time corresponding to the specific characteristic value Select the decision content of string；The string of candidate corresponding to the decision content for meeting preset condition is set as target word.This programme embodiment passes through At least one characteristic value of candidate string and combining classification device are calculated to identify target word, relative to directly artificial settings threshold value and letter Single linear analysis method, accuracy rate has with recall rate significantly to be promoted, and largely reduces artificial screening cost, improves target word Recognition efficiency.

The additional aspect of the present invention and advantage will be set forth in part in the description, these will become from the following description It obtains obviously, or recognized by the embodiment of this programme.

[Detailed description of the invention]

It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art Needed in attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only some realities of the invention Apply example, however, the present invention is not limited thereto.

Fig. 1 is the flow diagram in one embodiment of the recognition methods of target word of the present invention；

Fig. 2 is the flow diagram in one embodiment of the recognition methods of target word of the present invention；

Fig. 3 is the structural schematic diagram in one embodiment of the identification device of target word of the present invention；

Fig. 4 is the structural schematic diagram in one embodiment of the identification device of target word of the present invention；

Fig. 5 is the structural schematic diagram in one embodiment of the identifying system of target word of the present invention；

Fig. 6 is the structural schematic diagram in one embodiment of the identifying system of target word of the present invention.

[specific embodiment]

The present invention is further described with exemplary embodiment with reference to the accompanying drawing, the examples of the embodiments are attached It is shown in figure, in which the same or similar labels are throughly indicated same or similar element or there is same or like function Element.The embodiments described below with reference to the accompanying drawings are exemplary, for explaining only the invention, and cannot be construed to pair Limitation of the invention.In addition, if the detailed description of known technology is for showing the invention is characterized in that unnecessary, then by it It omits.

Those skilled in the art of the present technique are appreciated that unless expressly stated, singular " one " used herein, " one It is a ", " described " and "the" may also comprise plural form.It is to be further understood that being arranged used in specification of the invention Diction " comprising " refer to that there are the feature, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition Other one or more features, integer, step, operation, element, component and/or their group.It should be understood that when we claim member Part is " connected " or when " coupled " to another element, it can be directly connected or coupled to other elements, or there may also be Intermediary element.In addition, " connection " used herein or " coupling " may include being wirelessly connected or wirelessly coupling.It is used herein to arrange Diction "and/or" includes one or more associated wholes for listing item or any cell and all combinations.

Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (including technology art Language and scientific term), there is meaning identical with the general understanding of those of ordinary skill in fields of the present invention.Should also Understand, those terms such as defined in the general dictionary, it should be understood that have in the context of the prior art The consistent meaning of meaning, and unless idealization or meaning too formal otherwise will not be used by specific definitions as here To explain.

Those skilled in the art of the present technique are appreciated that server used herein above, cloud, remote network devices etc. are general It reads, there is effects equivalent comprising but it is not limited to computer, network host, single network server, multiple network server collection Or the cloud that multiple servers are constituted.Here, cloud is by a large amount of computers or network clothes based on cloud computing (Cloud Computing) Business device is constituted, wherein cloud computing is one kind of distributed computing, and one consisting of a loosely coupled set of computers super Virtual machine.In the embodiment of the present invention, it can lead between remote network devices, the identifying system equipment and WNS server Cross any communication mode and realize communication, including but not limited to, mobile communication based on 3GPP, LTE, WIMAX, based on TCP/IP, The computer network communication of udp protocol and low coverage wireless transmission method based on bluetooth, Infrared Transmission standard.

It should be noted that An embodiment provides a kind of recognition methods of target word, i.e., from server Visual angle the recognition methods is described, the recognition methods of target word can be embodied as computer program in distal end net by programming Realized in network equipment comprising but be not limited to computer, network host, single network server, multiple network server collection or The cloud that multiple servers are constituted.

Referring to Fig. 1, an a kind of exemplary embodiments of the recognition methods of target word of the invention, specifically include following step It is rapid:

S11 carries out participle division based on candidate string of the text data of minimum particle size to acquisition, obtains word segmentation result.

In an embodiment of the invention, the text data of minimum particle size can be single word, i.e., is with word by candidate's string Unit carries out participle division；Can certainly individually segment, in a preferred embodiment of this programme, language can be expressed Candidate's string is divided into multiple candidate participles by text data of the most succinct vocabulary of justice as minimum particle size.Wherein, participle division can To use the segmenting method based on dictionary, the segmenting method based on statistics can also be used, since the accuracy of participle is to final The accuracy rate of the target word identified has a certain impact, and therefore, it is necessary to select suitable segmenting method according to the actual situation.

In order to improve target word described in this programme recognition methods accuracy, it may be considered that calculate target word at least One specific characteristic value.In one embodiment of this programme, based on candidate of the text data of minimum particle size to acquisition go here and there into Row participle divides, and after obtaining candidate participle, then is combined at least one described candidate participle, obtains the candidate string and correspond to Word segmentation result.It can be appreciated that including in word segmentation result: candidate's string itself is divided with the text data of minimum particle size To candidate participle, at least two candidates divide contamination.

For example it is assumed that candidate string obtained is movie name " I is not Lady Pan Jinlian ", participle division is carried out to it, it is acquired Candidate participle be " I ", "no", " Lady Pan Jinlian ", then its corresponding word segmentation result includes: " I ", "no", " Pan Jin Lotus ", " I is not ", " not being Lady Pan Jinlian ", " my Lady Pan Jinlian ", " I is not Lady Pan Jinlian ".

Further, it can be appreciated that carrying out participle division in candidate string of the text data based on minimum particle size to acquisition, Before obtaining word segmentation result, it is also necessary to obtain multiple candidate strings.Attached drawing 2 is referred to, it is described in one embodiment of this programme The recognition methods of target word further comprises the steps of:

S101 is based on default extracting rule orientation by search engine and obtains candidate string.

It can be appreciated that the comparison that the desired candidate string obtained of the present embodiment occurs in certain internet corpus is frequent, And occur in other internet corpus it is then fewer, therefore, current embodiment require that based on default extracting rule orientation Obtain candidate string.The quality of candidate string can be improved in the mode that orientation obtains, and avoids since corpus is not smart caused some original The vocabulary for being not belonging to candidate string is mixed into the ranks of candidate string.

For example, in one exemplary embodiment of the present invention, using oriented network spider from BBS, blog or use of trust Candidate string is extracted in the dictionary of family, for example, the user's neologisms upload function etc. provided on search dog input method official homepage.Certainly, it stands The selection of point can be specified sites, can be the classification point filtering based on extracting content on web pages.Further, of the invention In one embodiment, the default extracting rule such as contextual features such as frequency, punctuate, length that can occur according to corpus is obtained Candidate's string.Certainly, those skilled in that art should know, which is only exemplary, and cannot constitute to the present invention The restriction of scheme.

In yet another embodiment of the present invention, it in order to improve the quality of selected candidate string, and then improves in the present embodiment The accuracy rate and efficiency for identifying target word, after being based on default extracting rule orientation by search engine and obtaining candidate string, also Comprising steps of

S102 carries out data purification pretreatment to the candidate string.

Specifically, in the exemplary embodiment of this programme, it can be from the html tag removed in advance on format in webpage Etc. invalid informations, or from some fixed form information etc. removed in advance in content in BBS webpage.

Further, attached drawing 1 is referred to, the recognition methods of target word of the present invention further comprises the steps of:

S12 calculates at least one specific characteristic value of each candidate participle in the word segmentation result.

Specifically, using plurality of classes to improve the accuracy rate and recall rate of the recognition methods of target word in the present invention Specific characteristic value to candidate string determine.The four category feature values that one embodiment of the invention considers, it is special based on distinguishing Sign, statistics category feature, incidence coefficient category feature and context category feature.

In one embodiment of the application, when calculating candidate string participle mode, according to every in the word segmentation result of candidate's string The number of words of a candidate participle carrys out partition mode.Wherein " 1+1+1 " indicates that word segmentation result is 3 words, 1 word of each word；" 3+2 " table Show that word segmentation result is 2 words, respectively 3 words and 2 words, such as the candidate string participle mould of candidate string " I/be not/Lady Pan Jinlian " Formula is 1-2-3；Its candidate's participle length maximum value is 3.

Specifically, wherein foundation characteristic, which includes at least, candidate string inquiry times, candidate string participle mode, candidate participle The maximum value of length, the mean search frequency of candidate participle and ratio of candidate string inquiry times etc..

Specifically, statistics category feature indicates the statistical information of candidate string, including at least has tightness, joint probability, item Part probability, anti-conditional probability, point mutual information, second order point mutual information, log-likelihood ratio, normalization expectation etc..Likewise, in this Shen In one embodiment please, when calculating joint probability, the point statistics category feature such as mutual information and conspicuousness, expression is candidate string Former and later two candidate participle between calculating, each specific characteristic can be obtained according to the calculation formula of each specific characteristic value Value；If candidate string there are multiple candidate participles, after calculating the corresponding specific characteristic value of candidate participle two-by-two, calculation can be taken respectively Art is average, geometric average, maximum value and minimum value calculate the corresponding specific characteristic value of candidate string.

Specifically, incidence coefficient category feature indicates the incidence relation between the candidate participle of candidate string, calculation method with It is similar to count category feature comprising probability ratio, increment, Jie Kade distance (Jaccard Distance), Simpson's distance (Simpson Distance), Coase measurement (Kosgen Measurement), Pi Tesiji Xia Piluo measurement (PiaterskyShapiro Measurement) etc..For example, calculating Jie Kade distance in one embodiment of the application When (Jaccard distance) and chance ratio (Odds ratio, OR), it is assumed that x, y respectively indicate the front and back two of a candidate string A candidate participle；Xy indicates that two candidates segment while occurring,Indicate that x does not occur and y occurs,Indicate x occur and y not Occur,It indicates that x and y does not occur, calculates separately the number of above four kinds of situations, further in accordance with Jie Kade distance or chance ratio Calculation formula calculate.Certainly, those skilled in that art should know, which is only exemplary, and be unable to structure The restriction of pairs of the present invention program.

Specifically, context category feature indicates the relationship between candidate string and context, include at least context entropy, on Hereafter diversity, left comentropy, right comentropy, left adjacent diversity, right adjacent diversity, the candidate context gone here and there with the word Cosine distance between TF-IDF vector etc..Wherein, context entropy indicates the variation institute of the context of candidate string in queries The information content for including；Context diversity indicates that the different word of how many kinds of occurred in the context of candidate string in queries；Left neighbour Connecing variation number indicates that the different word of how many kinds of occurred in the left adjacent word of candidate string in queries；It can be according to each specific characteristic value Calculation formula obtain corresponding specific characteristic value.

Specifically, assuming in one embodiment of this programme, only with foundation characteristic, statistical nature, incidence coefficient class It is calculated with several characteristic values corresponding to any one in contextual feature, without considering other three category features, although energy Identify target word, but accuracy rate and recall rate are far below several features for selecting wherein at least two or more seed types Value.

It can be appreciated that it is special to choose foundation characteristic, statistical nature, incidence coefficient class and context according to the actual situation Several characteristic values in sign calculate.In an exemplary embodiment of this programme, it can be used including tightness, candidate string Participle mode, joint probability, the maximum value of candidate participle length, context diversity, conspicuousness, candidate string inquiry times, Left adjacent variation number, than 10 Jie Kade distance, chance specific characteristic values, have obtained preferably determining result.It can be appreciated that In the other embodiments of this programme, one or more in above-mentioned 10 specific characteristic values can also be selected, is obtained preferably Determine result.Certainly, those skilled in that art should know, which is only exemplary, and cannot constitute to this hair The restriction of bright scheme.

The specific characteristic value is inputted default classifier, obtains sentencing for the string of candidate corresponding to the specific characteristic value by S13 Definite value.

It can be appreciated that in order to improve the accuracy rate of the recognition methods of target word described in this programme and recall rate, this programme Nonlinear Classifier is used in one embodiment, rather than uses the simple linear classifier of direct given threshold.For example, can adopt With the progressive decision tree of gradient (Gradient boosting and Decision tree, GBDT) sorting algorithm or depth nerve Network (Deep Neural Network, DNN) sorting algorithm selects training sample word required for deep neural network certainly It wants enough, can select which kind of non1inear classifying algorithm used according to the actual situation.In an exemplary implementation of the invention In example, the present embodiment is realized using GBDT sorting algorithm, GBDT sorting algorithm is formed using several decision trees.

Further, it can be appreciated that first to adjust the parameter of GBDT.Specifically, in one embodiment of this programme, it will The step of specific characteristic value inputs default classifier, obtains the decision content that candidate corresponding to the specific characteristic value goes here and there it Before, further includes:

According to institute after the default classifier of at least one specific characteristic value input by the sample word in training sample set of words The decision content of acquisition, to adjust the parameter of the default classifier；Wherein the training sample word set is combined into identified sample The set of word.

Specifically, the parameter of the classifier includes minimum sample number required for peripheral node in decision tree, tree Depth capacity, the quantity of decision tree, for classification specific characteristic value number, determine whether candidate string is the default of target word Threshold value range etc..Wherein, in order to avoid there is the case where overfitting, minimum sample number cannot be too small, and the maximum set is deep Degree also should not too deep and the specific characteristic value for classification number it is also unsuitable too many, certainly, excessive sample number may also can There is poor fitting phenomenon.In an example embodiment of this programme, the number of the specific characteristic value for classification is preferably with referring to The square root for determining the sum of characteristic value can at most be attempted certainly in practical application, should be attempted using different values with referring to Determine the 30%-40% of the sum of characteristic value, the case where overfitting can be avoided the occurrence of.

Specifically, by decision content obtained after the default classifier of at least one specific characteristic value input of sample word and being somebody's turn to do The recognition result of sample word is compared, if it is determined that value is different from the recognition result of the sample word, then adjusts the ginseng of the classifier Number repeats above procedure, until the default classifier is consistent with recognition result.

Further, after the parameter for obtaining optimal GBDT classifier, by least one specific characteristic value of candidate's string After inputting default classifier, the decision content of the string of candidate corresponding to specific characteristic value is calculated.

The string of candidate corresponding to the decision content for meeting preset condition is set as target word by S14.

Specifically, in one embodiment of the invention, preset condition can be the judgement for meeting predetermined threshold value range Value.Wherein the predetermined threshold value range can be passes through what the training of great amount of samples word obtained in abovementioned steps.It can be appreciated that After selecting different specific characteristic values corresponding to different characteristic types to input preset classifier, obtained decision content is not Together, and predetermined threshold value range corresponding to the decision content is also corresponding different, i.e., predetermined threshold value range is specified with selected Characteristic value is corresponding.Certainly, those skilled in that art should know, which is merely exemplary, and cannot constitute to this The restriction of scheme of the invention.

For example, in an application scenarios of this programme, we are described by taking movie name " I/be not/Lady Pan Jinlian " as an example The implementation of case embodiment.More than 10 a specific characteristics are extracted first, obtain " I/be not/Lady Pan Jinlian " this candidate more than 10 gone here and there Specific characteristic value, as the candidate goes here and there inquiry times (taking logarithm): 17.835, tightness: 0.995, adjacent diversity: 1287 etc. Deng；By these specific characteristic values by a GBDT classifier, obtain a candidate string whether be target word decision content 0.868783；It is 0.8 by the minimum threshold that the training of great amount of samples word obtains, then classifier score 0.868783 is greater than classification Thresholding 0.8, then " I is not Lady Pan Jinlian " is judged as meeting the target word of preset condition.

Similarly, in another application scenarios of this programme, with another candidate's string " film/I/be not/Lady Pan Jinlian/" For the implementation of this programme embodiment described.More than 10 a specific characteristics are extracted first, obtain " film/I/it is not/Lady Pan Jinlian " The correspondence specific characteristic value of this candidate's string, such as candidate string inquiry times (taking logarithm): 11.644, tightness: 0.679, it is adjacent Diversity: 81 etc.；By these specific characteristic values by a GBDT classifier, obtain whether a candidate string is target word Decision content 0.008744；It is 0.8 by the minimum threshold that the training of great amount of samples word obtains, then classifier score 0.008744 is small In classification thresholding 0.8, then " film I be not Lady Pan Jinlian " is judged as non-targeted word.

In conclusion a kind of recognition methods of target word provided by one embodiment of the present invention, the text based on minimum particle size Notebook data carries out participle division to the candidate string of acquisition, obtains word segmentation result；Calculate each candidate participle in the word segmentation result At least one specific characteristic value；At least one described specific characteristic value is inputted into default classifier, obtains the specific characteristic value The decision content of corresponding candidate string；The string of candidate corresponding to the decision content for meeting preset condition is set as target word.This programme Embodiment identifies target word by calculating at least one characteristic value and the combining classification device of candidate string, relative to directly artificially setting Determine threshold value and simple linear analysis method, accuracy rate has with recall rate significantly to be promoted, artificial screening cost is largely reduced, Improve the recognition efficiency of target word.

Further, according to the function modoularization thinking of computer software, one embodiment of the present of invention additionally provides one The identification device of kind target word.Refer to attached drawing 3, the identification device include word segmentation module 11, characteristic value calculating module 12, Decision content computing module 13 and selecting module 14 utilize the word segmentation module 11, characteristic value calculating module 12, decision content computing module 13 and selecting module 14 erect the principle framework of identification device, to realize modularization embodiment.It discloses in detail below The concrete function that each module is realized.

The word segmentation module 11 carries out participle division for candidate string of the text data based on minimum particle size to acquisition, Obtain word segmentation result.

In an embodiment of the invention, the text data of minimum particle size can be single word, i.e., is with word by candidate's string Unit carries out participle division；Can certainly individually segment, in a preferred embodiment of this programme, language can be expressed Candidate's string is divided into multiple candidate participles by text data of the most succinct vocabulary of justice as minimum particle size, the word segmentation module 11. Wherein, participle, which divides, can use the segmenting method based on dictionary, the segmenting method based on statistics can also be used, due to participle Accuracy have a certain impact to the accuracy rate of the target word finally identified, therefore, it is necessary to select to close according to the actual situation Suitable segmenting method.

In order to improve target word described in this programme identification accuracy, it may be considered that calculate at least one of target word Specific characteristic value.In one embodiment of this programme, the word segmentation module 11 is based on the text data of minimum particle size to acquisition Candidate string carry out participle division, after obtaining candidate participle, then at least one described candidate participle is combined, is obtained described The corresponding word segmentation result of candidate's string.It can be appreciated that including in word segmentation result: candidate goes here and there in itself, with the textual data of minimum particle size According to the candidate divided segments, at least two candidates divide contamination.

For example it is assumed that candidate string obtained is movie name " I is not Lady Pan Jinlian ", the word segmentation module 11 carries out it Participle divides, and obtained candidate participle is " I ", "no", " Lady Pan Jinlian ", then its corresponding word segmentation result includes: " I ", "no", " Lady Pan Jinlian ", " I is not ", " not being Lady Pan Jinlian ", " my Lady Pan Jinlian ", " I is not Lady Pan Jinlian ".

Further, it can be appreciated that in the word segmentation module 11 based on the text data of minimum particle size to the candidate of acquisition String carries out participle division, before obtaining word segmentation result, it is also necessary to obtain multiple candidate strings.Attached drawing 4 is referred to, the one of this programme In a embodiment, the identification device of the target word further include:

Candidate's string obtains module 10, obtains candidate string for being based on default extracting rule orientation by search engine.

It can be appreciated that the comparison that the desired candidate string obtained of the present embodiment occurs in certain internet corpus is frequent, And occur in other internet corpus it is then fewer, therefore, candidate described in the present embodiment go here and there obtain module 10 need It is oriented based on default extracting rule and obtains candidate string.The quality of candidate string can be improved in the mode that orientation obtains, and avoids due to language Expect that not smart caused some vocabulary for being not belonging to candidate string originally are mixed into the candidate ranks gone here and there.

For example, the candidate string obtains module 10 and uses oriented network spider in one exemplary embodiment of the present invention Candidate string is extracted from BBS, blog or user thesaurus of trust, for example, the user provided on search dog input method official homepage is new Word upload function etc..Certainly, the selection of website can be specified sites, can be the classification based on extracting content on web pages and put Filter.Further, in one embodiment of the invention, the candidate string obtains the frequency that module 10 can occur according to corpus The default extracting rule such as the contextual features such as rate, punctuate, length obtains candidate string.Certainly, those skilled in that art should Know, which is only exemplary, and cannot constitute the restriction to the present invention program.

In yet another embodiment of the present invention, it in order to improve the quality of selected candidate string, and then improves in the present embodiment It is fixed based on default extracting rule by search engine to obtain module 10 in the candidate string for the accuracy rate and efficiency of identification target word To after obtaining candidate string, data purification pretreatment is carried out to the candidate string.

Specifically, the candidate string obtains module 10 can be from format in advance in the exemplary embodiment of this programme Remove the invalid informations such as html tag in webpage, or from some fixed forms removed in advance in BBS webpage in content Information etc..

Further, attached drawing 3 is referred to, the characteristic value calculating module 12 is each in the word segmentation result for calculating At least one specific characteristic value of candidate's participle.

Specifically, in order to improve the recognition accuracy and recall rate of target word in the present invention, the characteristic value calculating module 12 determine candidate's string using the specific characteristic value of plurality of classes.It can be appreciated that common characteristic value be divided into foundation characteristic, Count category feature, incidence coefficient category feature and context category feature.

In one embodiment of the application, the characteristic value calculating module 12 when calculating candidate string participle mode, according to The number of words of each candidate participle carrys out partition mode in the word segmentation result of candidate's string.Wherein " 1+1+1 " indicates that word segmentation result is 3 Word, 1 word of each word；" 3+2 " indicates that word segmentation result is 2 words, respectively 3 words and 2 words, for example, candidate string " I/no / Lady Pan Jinlian " candidate string participle mode be 1-2-3；Its candidate's participle length maximum value is 3.

Specifically, assuming in one embodiment of this programme, only with foundation characteristic, statistical nature, incidence coefficient class It is calculated with several characteristic values corresponding to any one in contextual feature, without considering other three category features, although energy Identify target word, but accuracy rate and recall rate are far below several characteristic values for selecting wherein at least two or more types.

Further, attached drawing 3 is referred to, the decision content computing module 13 is pre- for inputting the specific characteristic value If classifier, the decision content of the string of candidate corresponding to the specific characteristic value is obtained.

It can be appreciated that in order to improve the accuracy rate of the identification of target word described in this programme and recall rate, one of this programme Decision content computing module described in embodiment 13 uses Nonlinear Classifier, rather than uses the simple linear of direct given threshold Classifier.For example, the progressive decision tree of gradient (Gradient boosting and can be used in the decision content computing module 13 Decision tree, GBDT) sorting algorithm or deep neural network (Deep Neural Network, DNN) sorting algorithm, when So training sample word required for deep neural network is selected to want enough, can selected according to the actual situation non-thread using which kind of Property sorting algorithm.In an exemplary embodiment of the present invention, the decision content computing module 13 uses GBDT sorting algorithm Realize the present embodiment, GBDT sorting algorithm formed using several decision trees.

Further, it can be appreciated that first to adjust the parameter of GBDT.Specifically, in one embodiment of this programme, also It include classifier setup module, the classifier setup module, for the decision content computing module 13 at least one by described in A specific characteristic value inputs default classifier, before obtaining the decision content of the string of candidate corresponding to the specific characteristic value, foundation At least one specific characteristic value of sample word in training sample set of words is inputted to decision content obtained after presetting classifier, To adjust the parameter of the default classifier；Wherein the training sample word set is combined into the set of identified sample word.

Specifically, the parameter of the classifier includes minimum sample number required for peripheral node in decision tree, tree Depth capacity, the quantity of decision tree, for classification specific characteristic value number, determine whether candidate string is the default of target word Threshold value range etc..Wherein, in order to avoid there is the case where overfitting, minimum sample number cannot be too small, and the maximum set is deep Degree also should not too deep and the specific characteristic value for classification number it is also unsuitable too many, certainly, excessive sample number may also can There is poor fitting phenomenon.In an example embodiment of this programme, finger of the classifier setup module setting for classification The number for determining characteristic value is preferably the total square root of specific characteristic value, certainly in practical application, different values should be used Come the case where attempting, can at most attempt the 30%-40% of the sum with specific characteristic value, overfitting can be avoided the occurrence of.

Specifically, after at least one specific characteristic value of sample word is inputted default classifier by the classifier setup module Decision content obtained is compared with the recognition result of the sample word, if it is determined that value is different from the recognition result of the sample word, The parameter of the classifier is then adjusted, above procedure is repeated, until the default classifier is consistent with recognition result.

Further, described to sentence after obtaining the parameter of optimal GBDT classifier by the classifier setup module After at least one specific characteristic value of candidate's string is inputted default classifier by fixed value calculation module 13, specific characteristic value is calculated The decision content of corresponding candidate string.

Further, attached drawing 3, the selecting module 14, for that will meet corresponding to the decision content of preset condition are referred to Candidate string be set as target word.

Specifically, in one embodiment of the invention, preset condition can be the judgement for meeting predetermined threshold value range Value.It can be appreciated that after selecting different specific characteristic values corresponding to different characteristic types to input preset classifier, gained The decision content that arrives is different, and predetermined threshold value range corresponding to the decision content is also corresponding different, i.e., predetermined threshold value range with Selected specific characteristic value is corresponding.Certainly, those skilled in that art should know, which is merely exemplary, The restriction to the present invention program cannot be constituted.

Similarly, in another application scenarios of this programme, with another candidate's string " film/I/be not/Lady Pan Jinlian/" For the implementation of this programme embodiment described.More than 10 a specific characteristics are extracted first, obtain " film/I/it is not/Lady Pan Jinlian " The correspondence specific characteristic value of this candidate's string, such as candidate string inquiry times (taking logarithm): 11.644, tightness: 0.679, it is adjacent Diversity: 81 etc.；By these specific characteristics by a GBDT classifier, obtain whether a candidate string is sentencing for target word Definite value 0.008744；It is 0.8 by the minimum threshold that the training of great amount of samples word obtains, then classifier score 0.008744 is less than Classify thresholding 0.8, then " film I be not Lady Pan Jinlian " is judged as non-targeted word.

In conclusion a kind of identification device of target word provided by one embodiment of the present invention, passes through 11 base of word segmentation module Participle division is carried out to the candidate string of acquisition in the text data of minimum particle size, obtains word segmentation result；Characteristic value calculating module 12 Calculate at least one specific characteristic value of each candidate participle in the word segmentation result；The general of decision content computing module 13 is described at least One specific characteristic value inputs default classifier, obtains the decision content of the string of candidate corresponding to the specific characteristic value；Selecting module The string of candidate corresponding to the decision content for meeting preset condition is set as target word by 14.This programme embodiment is by calculating candidate string At least one characteristic value and combining classification device identify target word, relative to directly artificial settings threshold value and simple linear analysis Method, accuracy rate has with recall rate significantly to be promoted, and artificial screening cost is largely reduced, and improves the recognition efficiency of target word.

Further, attached drawing 5 is referred to, one embodiment of this programme additionally provides a kind of identifying system of target word, Comprising:

The identification device 100 of target word is segmented for candidate string of the text data based on minimum particle size to acquisition It divides, obtains word segmentation result；Calculate at least one specific characteristic value of each candidate participle in the word segmentation result；By described in extremely A few specific characteristic value inputs default classifier, obtains the decision content of the string of candidate corresponding to the specific characteristic value；It will meet The string of candidate corresponding to the decision content of preset condition is set as target word；

Training sample identification device 200, for providing the parameter institute of setting classifier to the identification device of the target word The training sample word needed；

Target word collection device 300, the target word that the identification device for receiving the target word is identified.

For ease of description, it illustrate only part related to the embodiment of the present invention, it is disclosed by specific technical details, Please refer to present invention method part.

It can be appreciated that above-mentioned module as illustrated by the separation member may or may not be physically separated, Component shown as a unit may or may not be physical unit, it can and it is in one place, or can also divide On cloth to multiple network units, can selecting some or all of the modules therein according to the actual needs, scheme is not real to realize Apply the purpose of example.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..Further, attached drawing 6 is referred to, it illustrates the identification systems of the target word of one embodiment of the present of invention The structural block diagram of system, the system for realizing above-mentioned target word recognition methods.For ease of description, it illustrate only and this The relevant part of inventive embodiments, it is disclosed by specific technical details, please refer to present invention method part.

The identifying system includes processor 40 and storage medium 50.Wherein, storage medium 40 can be used for storing software Program and module, processor 50 is stored in the software program and module of storage medium 40 by operation, thereby executing described The various function application and data processing of identifying system.Storage medium 40 can mainly include storing program area and storing data Area, wherein storing program area can application program needed for storage program area, at least one function etc.；Storage data area can deposit Storage uses created data (such as audio data, phone directory etc.) etc. according to the identifying system.In addition, storage medium 40 It may include high random access storage medium, can also include non-volatile memory medium, for example, at least a disk storage Medium member, flush memory device or other volatile solid-state storage medium parts.

The processor 50 is the control centre of system, utilizes the entire identifying system of various interfaces and connection Various pieces by running or execute the software program and/or module that are stored in storage medium 40, and are called and are stored in Data in storage media 40 execute the various functions and processing data of the identifying system, to carry out to the identifying system Integral monitoring.Optionally, processor 50 may include one or more processing units；Preferably, processor 50 can integrate at Manage device and modem processor, wherein the main processing operation system of application processor, user interface and application program etc. are adjusted Demodulation processor processed mainly handles wireless communication.It is understood that above-mentioned modem processor can not also integrate everywhere It manages in device 50.

It can be appreciated that identifying system further includes the power supply powered to all parts, it is preferred that power supply can although being not shown With by power-supply management system and processor 50 it is logically contiguous, thus by power-supply management system realize management charge, discharge, with And the functions such as power managed；It can also include radio frequency (Radio Frequency, RF) circuit, input unit, display unit, biography The components such as sensor, voicefrequency circuit, wireless module.

In one embodiment of this programme, a kind of storage medium 40 is provided, the storage medium 40 includes storage Program, wherein the target identification system where controlling the storage medium 40 in described program operation executes any aforementioned mesh The step of marking the recognition methods of word, to realize following functions:

Further, the specific characteristic value includes at least one of:

In the instructions provided here, although the description of a large amount of detail.It is to be appreciated, however, that of the invention Embodiment can practice without these specific details.In some embodiments, it is not been shown in detail well known Methods, structures and technologies, so as not to obscure the understanding of this specification.

Although having been illustrated with some exemplary embodiments of the invention above, those skilled in the art will be managed Solution, in the case where not departing from the principle of the present invention or spirit, can make a change these exemplary embodiments, of the invention Range is defined by the claims and their equivalents.

Claims

1. a kind of recognition methods of target word characterized by comprising

At least one described specific characteristic value is inputted into default classifier, obtains sentencing for the string of candidate corresponding to the specific characteristic value Definite value；

2. the method according to claim 1, wherein it is described based on the text data of minimum particle size to the time of acquisition The step of choosing string carries out participle division, obtains word segmentation result, comprising:

3. the method according to claim 1, wherein the specific characteristic value includes at least one of:

4. according to the method described in claim 3, it is characterized in that, the foundation characteristic includes at least one of: candidate's string Inquiry times, candidate string participle mode, the maximum value of candidate participle length, the mean search frequency of candidate participle, candidate string are looked into Ask the ratio of number.

5. according to the method described in claim 3, it is characterized in that, the statistics category feature includes at least one of: close Degree, joint probability, conditional probability, anti-conditional probability, point mutual information, second order point mutual information, log-likelihood ratio, normalization expectation.

6. according to the method described in claim 3, it is characterized in that, the incidence coefficient category feature includes at least one of: Probability ratio, increment, Jie Kade distance, Simpson's distance.

7. according to the method described in claim 3, it is characterized in that, the context category feature includes at least one of: on Hereafter entropy, context diversity, left comentropy, right comentropy, left adjacent diversity, right adjacent diversity.

8. according to the method described in claim 1, it is characterized by: the default classifier includes the progressive decision tree of gradient point Class device.

9. the method according to claim 1, wherein the decision content for meeting preset condition, comprising: meet pre- The decision content of gating limits, and the predetermined threshold value range is corresponding at least one described specific characteristic value.

10. the method according to claim 1, wherein further include:

Default extracting rule orientation, which is based on, by search engine obtains candidate string.

11. method according to claim 10, which is characterized in that described to be based on default extracting rule orientation by search engine After the step of obtaining candidate string, further includes:

Data purification pretreatment is carried out to the candidate string.

12. a kind of identification device of target word characterized by comprising

Word segmentation module carries out participle division for candidate string of the text data based on minimum particle size to acquisition, obtains participle knot Fruit；

Characteristic value calculating module, for calculating at least one specific characteristic value of each candidate participle in the word segmentation result；

Decision content computing module obtains the specific characteristic at least one described specific characteristic value to be inputted default classifier The decision content of the corresponding candidate string of value；

13. a kind of identifying system of target word characterized by comprising

The identification device of target word carries out participle division for candidate string of the text data based on minimum particle size to acquisition, obtains To word segmentation result；Calculate at least one specific characteristic value of each candidate participle in the word segmentation result；Will it is described at least one Specific characteristic value inputs default classifier, obtains the decision content of the string of candidate corresponding to the specific characteristic value；Default item will be met The string of candidate corresponding to the decision content of part is set as target word；

Training sample identification device, for training needed for providing the parameter that classifier is arranged to the identification device of the target word Sample word；

14. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein run in described program When control the storage medium where system perform claim the step of requiring 1 to 11 any the method.