CN105488098B

CN105488098B - A kind of new words extraction method based on field otherness

Info

Publication number: CN105488098B
Application number: CN201510711219.7A
Authority: CN
Inventors: 史树敏; 周新宇; 黄河燕; 史胜清
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2015-10-28
Filing date: 2015-10-28
Publication date: 2019-02-05
Anticipated expiration: 2035-10-28
Also published as: CN105488098A

Abstract

The present invention relates to a kind of methods of new words extraction based on field otherness, belong to natural language processing applied technical field.The otherness that the present invention is distributed by comparing word between different field first, obtain difference word seed, then difference word seed is expanded by n-gram mode, construct candidate word set, next according to field difference size remove candidate word set in repetitor, finally to each word in candidate word set, respectively with field difference value, coagulate it is right, and it rejects the lower candidate word of field difference as measurement standard at word rate and obtains neologisms.The prior art is compared, the present invention passes through using different information between different corpus fields, selected seed word, and expands by n-gram and obtain candidate word set；Then the neologisms in candidate word are automatically selected, to significantly improve the number and accuracy of new word discovery using different information between word itself and field again.

Description

A kind of new words extraction method based on field otherness

Technical field

The present invention relates to a kind of method of new words extraction, in particular to the side of a kind of new words extraction based on field otherness Method belongs to natural language processing applied technical field.

Background technique

Network neologisms, which refer to, there are some special languages of cocurrent enforcement or text along with internet.It is typically derived from Video display network hot topic term, or some words acceptable to all generated by a certain social phenomenon.Network neologisms are in net Network field text, such as: being frequently occurred in discussion bar, microblogging.Statistics discovery, China appear in people's per year over 1000 neologisms In daily life.According to related research result, the participle mistake more than 60% comes automatic network neologisms, the order of accuarcy of new word identification Directly affect the performance of intelligent information handling system.Such as: it is fixed in the text emotion analysis task of Intelligent Information Processing Phrase collocation can embody feeling polarities, for neologisms phrase, if can not to its it is correct identify, will lead to and judged Feeling polarities distortion.Such as: " expression very tall and big on " (this is the net exploxer comment of a product) actually should " on tall and big " here As a network neologisms, integrally the positive emotion of " high-end and atmospheric to improve grade " is indicated, however almost all of application at present In system, the annotated sequence that is formed after word segmentation processing is " expression/v ten divides/adv high/adj is big/adj is upper/adv ", it may be assumed that by the net Network neologisms are cut into individual character, and the word segmentation processing of mistake makes this be lost the meaning that positive emotion is inclined to, to the intelligence of follow-up It can analyze to produce and seriously affect.Therefore there is very important meaning in natural language processing field to effective identification of neologisms Justice.

Currently, new words extraction is broadly divided into two class of rule-based method and Statistics-Based Method.Rule-based approach Main thought be: be conceived to the word-building principle of neologisms, as theoretical foundation and establish one and help to identify neologisms Common corpus；Then itself characteristic of speech sounds for studying word builds a special structure based on the natural quality of word Word rule base.Rule-based method is higher to the recognition accuracy of neologisms, but needs extremely strong language attainment and related fields Knowledge background.Statistics-Based Method realizes new word identification, and there are mainly two types of means, and one is must using new words extraction as participle Indispensable a part is finally inferred to most possible separation by certain statistical model and obtains neologisms.Classical Statistical model is had ready conditions under random field (Conditional Random Fields, CRF), the gradient based on characteristic frequency information Training pattern etc. drops.Another means are using new words extraction as an individual task, it usually needs do part-of-speech tagging The pretreatment of (Part-Of-Speech, POS).Since network neologisms have a real-time, the features such as circulation is strong, dynamic change, Therefore pure rule-based method is often ineffective；And using statistical means acquisition network neologisms, there is also training completely Sparse, validity feature extract the deficiencies of difficult.The method that major part researcher is combined using rule and statistics at present, with Phase plays respective advantage, however these methods all have ignored the information characteristics advantage of corpus itself, it may be assumed that same words are in difference Information (intension) difference between the theme of field is embodied as the corresponding word distribution performance of same words under different field theme not Together.

Summary of the invention

The present invention proposes a kind of new words extraction based on field otherness for the neologisms for constantly generating and using in network Method, this method make full use of the characteristic of different field corpus itself, under existing general appraisement system, effectively increase neologisms The accuracy rate of identification.

Idea of the invention is that obtaining difference word seed by comparing the otherness that word between different field is distributed, passing through n- Gram mode expands difference word, constructs candidate word set, then to each word in candidate word set, respectively with field difference value, It coagulates right, and at word rate as measurement standard, further extracts and obtain neologisms.

Related definition involved in the present invention is as follows:

Define 1: field difference word refers to the individual character of embodiment field otherness, which can reflect domain features, The frequency of occurrences has very big difference in different field corpus.Such as, if individual character c frequency of occurrences f in network corpus_internet(c) with Frequency of occurrences f in News Field_newsIt the ratio between (c) is more than threshold value λ, then c is referred to as field difference word.It is existing at the language of word for individual character As if it can symbolize otherness.The present invention also assert its difference performance with word distribution.

Definition 2: repetitor, as word W_AWith word W_BMeet conditionClaim W_BAnd W_ARepetitor each other.Such as: " happiness is big It is general to run quickly " (W_A) and " general greatly to run quickly " (W_B)。

Define 3: field difference value DV (Difference Value), the measurement of field otherness, using word W in network language Expect frequency of occurrences f_internet(W) with news corpus frequency of occurrences f_news(W) it is calculated；Wherein f_internet(W) indicate word W in net The frequency of occurrences in network corpus, f_news(W) word W frequency of occurrences in news corpus is indicated.

It defines 4: coagulating right CV (Concrete Value), measure word by the quantizating index of correct cutting.Such as " cinema " There are " film "+" institute " and the solidifying conjunction mode of two kinds of " electricity "+" movie theatre ".To any word W=c₁c₂(wherein, c₁Or c₂It indicates to constitute the word Word or word), by enumerating its all possible solidifying conjunction mode, calculate corresponding weight, take wherein minimum value, it is solidifying as the word It is right.

It defines 5: at word rate NWP (New Word Probability), judging whether certain individual character sequence forms the finger of word Mark.Such as: " liking to say ", " love is eaten " are made of individual character, but NWP is very low, that is, indicate that the two does not constitute word.

Purpose of the invention is through the following steps that realize:

A kind of new words extraction method based on field otherness, comprising the following steps:

Certain field of neologisms to be obtained is inputted corpus S by step 1₁With other field corpus S₂Compare acquisition field Difference word seed；

Preferably, obtaining field difference word seed by following steps:

(1) S is counted respectively₁And S₂In each word " c " occur frequency f_s1(c) and f_s2(c)；

(2) each word is calculated in S by following formula₁And S₂In difference value:

D_{word_seg}(c)=f_s1(c)/1+f_s2(c)

(3) given threshold λ, if the difference value D of word " c "_{word_seg}(c) it is more than threshold value λ, word " c " is used as difference word kind Son.

Step 2 expands field difference word seed, constructs candidate word set Set_candidate；

Preferably, being expanded by following steps using n-gram mode, detailed process is as follows:

(1) in corpus S₁In, take n=2 respectively, 3,4,5, its corresponding all n-gram word is obtained, to these n- Gram word retains if including any difference word, and counts these n-gram word frequencies of occurrences, and candidate word set is added Set_candidate；

(2) to candidate word set Set_candidateIn all candidate word W, with preset thresholdCompare, if its word frequencyIn candidate word set Set_candidateIn leave out W；

Step 3: candidate word set Set is removed according to the field difference size of candidate word_candidateIn repetitor；

Preferably, the field difference of candidate word W can be calculated by the following formula:

DV (W)=log (1+f_s1(W)/(1+f_s2(W)))

Wherein f_s1(W) indicate W in corpus S₁The frequency of middle appearance, f_s2(W) indicate word W in corpus S₂The frequency of middle appearance.

Further, better duplicate removal effect in order to obtain, the field difference of repetitor can comprehensively consider coagulate it is right with Field difference value obtains, i.e., according to defining 2, finds out candidate word set Set_CandidateIn all repetitor, repetitor is carried out Compare, selects the biggish reservation of weight in repetitor, it is lesser to give up；The process is repeated until candidate word set Set_Candidate In no longer contain repetitor, detailed process is as follows:

(1) according to defining 2, n=2 is taken, 3,4,5, to Set_CandidateIn all words compare, find out all repetitors, n table Show Set_CandidateThe individual character number for including in the word of set；

(2) right CV (W) and field difference value DV (W), calculating are coagulated according to what is defined 3, define 4 and calculate each repetitors Formula difference is as follows:

It coagulates right:

Field difference value:

DV (W)=log (1+f_s1(W)/(1+f_s2(W)))

Further, it is compared as follows weight V size after weighting shown in formula two-by-two to repetitor, it is biggish to leave weight Word:

V (W)=αⁿ*DV(W)+CV(W)

Wherein, a is parameter, indicates the measurement of permitted difference between different n-gram, and n indicates individual character number in word W, c_iIndicate i-th of word or word in word W, w₁And w₂For duplicate two words each other.

(3) repeat step (1), (2), until no longer containing repetitor in candidate word set.

Step 4: removal Set_CandidateThe middle lower candidate word of field difference, the candidate word that will be above preset threshold γ add Enter new set of words Y and export and obtains all neologisms.

DV (W)=log (1+f_s1(W)/(1+f_s2(W)))

Further, the field difference can be by candidate word set Set_candidateEach of candidate word, point Not according to defining 3,4,5, its field difference value (DV) is calculated, at word rate (NWP) and is coagulated right (CV), and it is pressed centainly Proportion composite characterizes, specific as follows:

(1) candidate word W difference value DV (W) is calculated according to the following formula:

DV (W)=log (1+f_s1(W)/(1+f_s2(W)))

(2) candidate word W is calculated according to the following formula into word rate NWP (W):

Wherein, f (c_i) indicate individual character c in W_iThe frequency of occurrences；Single(c_i) indicate after using participle tool, c_iThere is frequency Rate；

(3) candidate word W is calculated according to the following formula coagulate right CV (W):

(4) it by difference value (DV), at word rate (NWP), and coagulates right (CV) and is normalized respectively, normalize formula It is as follows:

Wherein, X_jCorresponding j-th word current value (difference value at word rate or is coagulated right), X_minIndicate the value in all words Minimum, X_maxIndicate the peak of the value in all words；

(5) candidate word W weight V is calculated according to the following formula:

V (W)=a*DV (W)+b*CV (W)+c*NWP (W)

Wherein, a, b and c respectively indicate difference value, coagulate ratio that is right, accounting for weight V at word rate.

Beneficial effect

The present invention compares the prior art, by different information, selected seed word between the different corpus fields of utilization, and passes through n- Gram, which is expanded, obtains candidate word set；Then candidate word is automatically selected using different information between word itself and field again In neologisms, to significantly improve the number and accuracy of new word discovery.

Detailed description of the invention

Fig. 1 is a kind of flow diagram of the new words extraction method based on field otherness of the embodiment of the present invention；

Fig. 2 is the method for the present invention and now there are four types of pair of the new words extraction method in terms of new word identification quantity and accuracy rate Compare result schematic diagram.

Specific embodiment

The method of the present invention is described in further details with embodiment with reference to the accompanying drawing.

Embodiment

The present embodiment is using network corpus as S₁, news corpus is as S₂For the method for the present invention is described in detail.

Network corpus selects a model in discussion bar as shown in table 1:

Table 1:

News corpus selects certain news in 4 days April in 2001 as shown in table 2:

Table 2:

A kind of new words extraction method based on field otherness, process flow are as shown in Figure 1, comprising the following steps:

Step 1: obtaining field difference word seed:

Field difference word is the word that frequency of occurrence is significantly more than other corpus in a kind of corpus, obtains field difference word Mode it is varied, this implementation is simply sentenced so that whether the frequency difference that word occurs in two kinds of corpus is higher than certain preset threshold Whether determine as field difference word seed, specific as follows:

Each word occurs in statistics network corpus the frequency and its frequency occurred in news corpus respectively；Then The difference value of the two is calculated, last set threshold value λ is 2, and the word using difference value more than or equal to λ is as difference word；Obtain difference word Set is as shown in table 3:

Table 3:

Step 2: expanding difference word seed, candidate word set is obtained

Difference word is expanded it is varied to obtain the mode of candidate word, such as pass through dictionary or use n-gram mode It is expanded, n-gram mode is used in the present embodiment, it is specific as follows: in network corpus, to take n=2,3,4 or 5 respectively, obtain It takes all n-gram to combine word string if including any difference word, to retain these n-gram words, if it is unintentionally Adopted word string, then delete.Such as: " good beautiful mew star people " can extract following n-gram form respectively:

2-gram { " good drift ", " beautiful ", " bright ", " mew ", " mew star ", " star people " },

3-gram { " good beautiful ", " beautiful ", " bright mew ", " mew star ", " mew star people " },

4-gram { " good beautiful ", " beautiful mew ", " bright mew star ", " mew star people " } and 5-gram { " good drift Bright mew ", " beautiful mew star ", " bright mew star people " }

Then, the word frequency of these n-gram is counted respectively, and threshold value is setWhen word W word frequency f (W) is more than threshold value And when including any of the above-described difference word, it is selected as candidate word, finally obtained candidate word set is as shown in table 4:

Table 4:

Step 3: removal repetitor.

First according to defining 2, candidate word set Set is found out_CandidateAll repetitors；It is below to be with " mew star people " All repetitors for finding out of example: { mew star, mew star people }, { star people, mew star people }, { mew star people, mew star people }, mew star people, love Mew star people }；

Secondly retain field according to the field difference size between repetitor two-by-two to differ greatly candidate word；Here, field Difference can be characterized simply with the frequency that candidate word occurs in two kinds of corpus, in the present embodiment overcome the simple frequency The influence different because of corpus of poor bring seeks logarithm using the two ratio to characterize, shown in following formula:

DV (W)=log (1+f_s1(W)/(1+f_s2(W)))

Further, the results show, if field difference can not only consider difference value DV in field shown in formula as above, It can also consider that better duplicate removal effect will be obtained if coagulating right CV, i.e., field difference passes through comprehensive both shown in following formula Weight later obtains:

V (W)=αⁿ*DV(W)+CV(W)

Therefore, according to defining 3,4, calculate each of the above word coagulates right and difference value.With { mew star people, the mew star of love People } for remove repetitor, mew star people's word frequency is 6, and mew star people's word frequency of love is 3, and word frequency is 0 in news corpus, then:

DV (mew star people)=log ((6+1)/(0+1))=0.845

DV (the mew star people of love)=log ((3+1)/(0+1))=0.602

CV (mew star people) has " mew "+" star people " and the solidifying conjunction mode of two kinds of " mew star "+" people ", coagulates right value and is respectively

CV (" mew "+" star people ")=6/ (8*6)=0.125

CV (" mew star "+" people ")=6/ (6*7)=0.143.

Its smaller value is taken to coagulate as word " mew star people " right

CV (mew star people)=0.125

Similarly CV (the mew star people of love) have " love "+" mew star people ", " love "+" mew star people ", " mew of love "+" star people ", Four kinds of solidifying conjunction modes of " the mew star of love "+" people ".

It coagulates right value and is respectively as follows:

CV (" love "+" mew star people ")=3/ (4*4)=0.185

CV (" love "+" mew star people ")=3/ (3*6)=0.167

CV (" mew of love "+" star people ")=3/ (3*6)=0.167

CV (" the mew star of love "+" people ")=3/ (3*7)=0.143 takes its smaller value as word " the mew star people of love " solidifying conjunction Degree

CV (the mew star people of love)=0.143

Taking a parameter is 1.1

V (mew star people)=0.845*1.1³+ 0.125=1.249

V (the mew star people of love)=0.602*1.1⁵+ 0.143=1.113

So retaining " mew star people " in this candidate word duplicate removal, leave out " the mew star people of love ".To Set_CandidateIn own Repetitor, execute step 3, until without repetitor generate.Finally determining candidate word is as shown in table 5:

Table 5:

Step 4: obtaining new set of words according to field differential screening candidate word and exporting.

Same step 3, the field difference can be characterized after frequency ratio takes logarithm between different corpus by candidate word, But the experiment proved that if field difference can comprehensively consider field difference value DV, at word rate NWP and coagulate right CV, according to Better effect will be obtained if integrating three according to a certain percentage shown in following formula:

V (W)=a*DV (W)+b*CV (W)+c*NWP (W)

To candidate word set Set_CandidateEach of candidate word, respectively according to define 3,4,5, it is poor to calculate its field Different value at word rate and is coagulated right:

Still by taking " mew star people " word as an example:

Difference value: DV (mew star people)=log ((6+1)/(0+1))=0.845

Coagulate right: CV (mew star people)=6/ (8*6)=0.125 (takes " mew "+" star people " to obtain minimum)

At word rate:

The present embodiment using ICTCLAS participle tool will above participle after obtain single (mew)=8, single (star)= 6, single (people)=7；F (mew)=8, f (star)=6, f (people)=7, f (mew star people)=6 again；Therefore

Further, it to obtain better extraction effect, needs Synthesis obtains the weight of field difference again after three of the above value is normalized；

Maximum, the minimum value of three kinds of values are respectively as follows: in 7 words shown in table 5

DV_max=0.903；DV_min=0.176；

CV_max=0.25；CV_min=0.071；

NWP_max=1；NWP_min=0；

After normalization, " mew star people " three kinds of values are respectively as follows:

Take a=0.6, b=0.4, c=-0.2；

V_{Mew star people}=0.6*0.920+0.4*0.302-0.2*0=0.6728

Thus the field difference for obtaining word all shown in table 5 is as shown in table 6:

Table 6:

Threshold gamma=0.4 is taken, filtering out all spectra difference and obtaining new set of words lower than the word of threshold gamma is { building-owner, mew star People, tinkling of pieces of jade body }.

Experimental result:

In order to verify the validity of new words extraction method of the embodiment of the present invention based on field otherness, this experiment is using new Unrestrained three days 6-8 in microblogging June days microblogging, amounts to 10,237,813 and Baidu " the big Supreme Being of Li Yi " amounts to 3,524,584 notes Son is used as network corpus, using the news data of Xinhua News Agency's all publications in 1993 to 2004, amounts to 9,517,292 sentences As news corpus, it is utilized respectively existing new words extraction method CV, NWP, EMI, PNWD and DV proposed by the present invention and DV+ CV+NWP method compares in terms of new word identification quantity and accuracy rate, and comparing result is as shown in Figure 2.

CV and NWP be those skilled in the art it is commonly understood that new words extraction statistical method, details are not described herein again.

The Enhanced Mutual Information algorithm that EMI:Zhang et al. was proposed in 2009, formula:

Wherein, word W=w₁w₂…w_n, w_iFor each word for constituting word, n is the number for constituting the word of word.F table Show word W frequency of occurrence, F_iIndicate word w_iFrequency of occurrence.The algorithm idea is to measure word to the dependence of each word, and value is got over Greatly, then a possibility that becoming word, is bigger.

Pattern-based new word identification (the Patten New Word that PNWD:Huang et al. was proposed in 2014 Detection) algorithm.The algorithm core concept be automatically select using POS markup information and by seed vocabulary meet it is short Language mode such as<ad, *, au>model, then the method for vocabulary newly occur is automatically extracted out by these models.

As shown in Fig. 2, k word before x-axis indicates in figure, the Average Accuracy AP (k) of k word before y-axis indicates.By can in figure To see, compared with benchmarks EMI, CV, NWP, DV, DV+CV+NWP obtain better effect, with benchmarks PNWD phase Than, DV and DV+CV+NWP effect is more preferable, and CV and NWP, when results set is smaller, accuracy ratio PNWD is slightly worse, and with knot The expansion of fruit data, CV and NWP are obviously improved again.This is because PWND can only have found the neologisms of adjective, and neglect The neologisms of other parts of speech have been omited, so, after the neologisms for efficiently identifying adjective, neologisms of the PWND for other parts of speech Discrimination decline.For DV, extraordinary effect is obtained, this method is primarily due to and takes full advantage of difference between different field Property, and neologisms are good at embodying this field otherness.For CV and NWP, recognition accuracy is slightly worse, is primarily due to CV and NWP 2-gram vocabulary is judged slightly worse, to 2-gram vocabulary, he can be divided into 2 individual characters, and the probability that individual character occurs is very big, makes It is extremely low at this 2 values of 2-gram, it is not easy to be identified, and there is greatly 2-gram vocabulary in neologisms, so 2 kinds of methods Effect is not satisfactory.DV+CV+NWP combines the advantage of tri- kinds of methods of DV, CV and NWP, obtains best result.Therefore, with Conventional method is compared, and the new words extraction method proposed by the present invention based on field otherness can obtain higher accuracy and discovery more More neologisms.

The above shows and describes the basic principles and main features of the present invention and the advantages of the present invention.The technology of the industry Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and the above embodiments and description only describe this The principle of invention, without departing from the spirit and scope of the present invention, various changes and improvements may be made to the invention, these changes Change and improve all within the scope of the claimed invention, the claimed scope of the invention is by appended claims and its waits Effect object defines.

Claims

1. a kind of new words extraction method based on field otherness, which comprises the following steps:

Certain field of neologisms to be obtained is inputted corpus S by step 1₁With other field corpus S₂Compare acquisition field difference word Seed；The field difference word, refers to the individual character of embodiment field otherness, and even individual character c occurs in certain class field corpus Frequency f_internet(c) in another kind of field corpus frequency of occurrences f_newsIt the ratio between (c) is more than threshold value λ, then c is referred to as field difference Word；

Step 2 expands field difference word seed by n-gram mode, constructs candidate word set Set_Candidate, detailed process is such as Under:

(1) in corpus S₁In, take n=2 respectively, 3,4,5, its corresponding all n-gram word is obtained, to these n-gram words, If including any field difference word, retain, and count these n-gram word frequencies of occurrences, candidate word set is added Set_Candidate；

(2) to candidate word set Set_CandidateIn all candidate word W, with preset thresholdCompare, if its word frequencyIn candidate word set Set_CandidateIn leave out W；Step 3 is removed according to the field difference size of candidate word Candidate word set Set_CandidateIn repetitor；

The field difference of the candidate word W is calculated by the following formula:

DV (W)=log (1+f_s1(W)/(1+f_s2(W)))

Wherein f_s1(W) indicate word W in corpus S₁The frequency of middle appearance, f_s2(W) indicate word W in corpus S₂The frequency of middle appearance；

Step 4, removal Set_CandidateThe middle lower candidate word of field difference, the candidate word that will be above preset threshold γ are added newly Set of words Y and export obtain all neologisms.

2. a kind of new words extraction method based on field otherness according to claim 1, which is characterized in that the field Difference word seed is obtained by following procedure:

D_{word_seg}(c)=f_s1(c)/f_s2(c)

(3) given threshold λ, if the difference value D of word " c "_{word_seg}(c) it is more than discrepancy threshold λ, word " c " is used as difference word kind Son.

3. a kind of new words extraction method based on field otherness according to claim 2, which is characterized in that λ=2.

4. a kind of new words extraction method based on field otherness according to claim 1 to 3, which is characterized in that institute It states and candidate word set Set is removed according to field difference size_CandidateIn repetitor pass through following steps carry out:

(1) n=2,3,4 or 5 are taken, to Set_CandidateIn all words be compared, find out all repetitors, n is indicated Set_CandidateThe number for the word for including in the word of set；

(2) repetitor found is comprehensively considered and coagulates right CV and field difference value DV and is calculate by the following formula its weight V, and Retain the biggish word of weight, the removal lesser word of weight to achieve the purpose that duplicate removal:

V (W)=αⁿ*DV(W)+CV(W)；

DV (W)=log (1+f_s1(W)/(1+f_s2(W)))；

Wherein, α is parameter, indicates the measurement of permitted difference between different n-gram, c_iIndicate i-th of word or word in word W, And W=c₁c₂；Wherein, f (W) indicates the frequency that word W occurs in corpus of text；

5. a kind of new words extraction method based on field otherness according to claim 4, which is characterized in that α=1.1.

6. a kind of new words extraction method based on field otherness according to claim 1 to 3, which is characterized in that institute State removal Set_Candidate" field difference " in the middle lower candidate word of field difference is by field difference value DV, at word rate NWP And coagulate the value after right CV is integrated according to a certain percentage, i.e. weight V is obtained especially by following procedure:

DV (W)=log (1+f_s1(W)/(1+f_s2(W)))

Wherein, f (c_i) indicate word c_iThe frequency of occurrences；Single(c_i) indicate after using participle tool, c_iThe frequency of occurrences；I indicates structure At the label of the words of W, n indicates to constitute the quantity of all words of word W；

(4) it by difference value (DV), at word rate (NWP), and coagulates right (CV) and is normalized respectively, normalization formula is such as Under:

Wherein, X_jCorresponding j-th of word current value, the current value are difference value, at word rate or coagulate right, X_minIndicate all words In the value minimum, X_maxIndicate the peak of the value in all words；

(4) candidate word W weight V is calculated according to the following formula:

V (W)=a*DV (W)+b*CV (W)+c*NWP (W)

7. a kind of new words extraction method based on field otherness according to claim 6, which is characterized in that a=0.6, b =0.4, c=-0.2.

8. any a kind of new words extraction method based on field otherness in -3,5 or 7, feature exist according to claim 1 In γ=0.4.