CN107391486A

CN107391486A - A kind of field new word identification method based on statistical information and sequence labelling

Info

Publication number: CN107391486A
Application number: CN201710594672.3A
Authority: CN
Inventors: 李辰刚; 王清琛
Original assignee: Nanjing Cloud Network Technology Co Ltd
Current assignee: Nanjing Cloud Network Technology Co Ltd
Priority date: 2017-07-20
Filing date: 2017-07-20
Publication date: 2017-11-24
Anticipated expiration: 2037-07-20
Also published as: CN107391486B

Abstract

The invention provides a kind of field new word identification method based on statistical information and sequence labelling.This method identifies possible neologisms by the way that text statistical information and CRF word sequence labellings method is respectively adopted, and is filtered based on background language material；The result of two kinds of recognizers is integrated as seed neologisms；Seed neologisms are updated to the candidate's neologisms for eliminating and being folded in text, finally select the new word list in optimal field.This method can exclude the mistake that multiple candidate's neologisms are identified in a position in urtext.Simultaneously by introducing background corpus, can solve statistical method can be mistakenly considered word combination more typical in language the new word problem in field.In addition, by combining two kinds of new word discovery methods and reducing the influence of frequency, some low-frequency field neologisms are more precisely identified compared to more existing method.Therefore, the present invention can lift the accuracy to low-frequency field new word identification, and can largely improve the precision of field new word identification.

Description

A kind of field new word identification method based on statistical information and sequence labelling

Technical field

The present invention relates to information retrieval and inquiry field, more particularly to a kind of field based on statistical information and sequence labelling New word identification method.

Background technology

It is different from the western language such as English, there is no clear and definite list separator between word when written Chinese is write.And In the understanding of people, the semantic of Chinese is again in units of word.Therefore, the natural language that Chinese terms are Chinese is recognized accurately An important step in speech processing.For computer, the word in Chinese is typically to be defined by dictionary and a small amount of word-building rule However one side, with society development and change.Neologism constantly produces；On the other hand, natural language processing technique is not by Apply to each professional domain disconnectedly, professional domain includes a large amount of domanial words.These original dictionaries without neologism all Challenge is proposed to natural language processing.

Field neologisms are the word that universaling dictionary is not included specific to the related text in a certain field.Field neologisms are known Other technology has a wide range of applications in natural language processing.For improving field text in information retrieval, information extraction, body Precision in the applications such as structure, text classification cluster suffers from important effect.

At present, from batch text identify new terminology mainly have it is following two：

(1) the field new word identification method based on statistical information, mainly according to the higher spy of word internal correlation degree Levy identification field neologisms, the general correlation technique using in statistics and information theory.Common flow is：With in statistics Method sets up the statistical information of text, and the new word string of candidate is screened according to statistical result, a collection of word is obtained and goes forward side by side Row desk checking.Conventional statistical method has Chi-square Test in hypothesis testing, T to examine, in log-likelihood ratio and information theory Point mutual information, conditional entropy method etc..Statistics-Based Method is not limited to a certain field independent of external resource, general Property is stronger.But explicit or implicit condition of the Statistics-Based Method generally using the frequency that field neologisms occur as judge, So this kind of algorithm identification neologisms depend on larger language material scale, and candidate field neologisms frequently occur in language material.Cause Low-frequency field new word identification effect is undesirable.On the other hand, can be same in text due to lacking the checking of local context Identify field neologisms that are several mutually overlapping or inclusion relation being present in position.Also result in field new word identification accurate rate and recall Rate is difficult to obtain satisfied balance.

(2) the field new word identification algorithm based on Supervised machine learning.This method is small using artificial or semi-automatic mark The training corpus of batch, by the feature that the character representation of field term is words distribution.Learn this with certain machine learning model A little features, recycle the neologism in the model prediction field text.Machine learning model currently used for field new word identification Mainly include maximum entropy model, SVMs, HMM, maximum entropy Markov model and condition random field to calculate Method etc..The recognition accuracy of this method is higher, and is not limited by the word frequency of occurrences.But the side based on machine learning Method needs user to participate in mark training it is anticipated that the degree height that people participates in, workload are big.Cause final mark language material and experimental amount Less, the practicality of this method is constrained.

In view of this, it would be highly desirable to develop a kind of recognition methods for the field neologisms that can solve the problem that above mentioned problem.

The content of the invention

The problem of purpose of the present invention is intended to overcome prior art to exist, statistical information and sequence are based on so as to provide one kind Arrange the field new word identification method of mark.

To achieve the above object, the invention provides a kind of field new word identification side based on statistical information and sequence labelling Method.This method comprises the following steps:

1) word and the frequency are counted to background corpus, to obtain background word frequency dictionary and background binary continues frequency word Allusion quotation, the background corpus are the corpus by word segmentation processing and manually proofreaded；

2) subordinate sentence, the word then included according to background corpus, using base are carried out to the text to be analyzed that user provides The Chinese Word Automatic Segmentation of dictionary in step 1) carries out word segmentation processing to the text after subordinate sentence to obtain multiple word segmentation units, right Multiple word segmentation units are post-processed, and obtain word segmentation result 1；

3) the continuative participle unit string of candidate character string condition will be met as candidate character string in word segmentation result 1；

4) calculation procedure 3) in each candidate character string cohesion degree and use the free degree；

5) the phrase probability of each candidate character string is calculated according to background corpus；

6) parameter of gained is calculated according to step 4) and step 5), calculates being scored into word for each candidate character string, will be into word For candidate character string of the scoring more than predetermined threshold value T1 as the new set of words 1 of candidate, the new set of words 1 of candidate is the collection of candidate's neologisms Close, including the morphology of candidate's neologisms, frequency of occurrence and score；

7) subordinate sentence, the word then included according to background corpus, using base are carried out to the text to be analyzed that user provides In the segmenting method of word mark, word segmentation processing is carried out to the text after subordinate sentence to obtain multiple word segmentation units, it is single to multiple participles Position is post-processed, and obtains word segmentation result 2；

8) expect that dictionary screens to the word segmentation unit in word segmentation result 2 using background in step 1), statistics is not being carried on the back Occur and be unsatisfactory for the frequency that the word segmentation unit of stop words rule occurs in scape language material dictionary, obtain the new set of words 2 of candidate, it is described The new set of words 2 of candidate is the set of candidate's neologisms, includes the morphology and frequency of occurrence of candidate's neologisms；

9) take respectively in step 6) in the new set of words 1 of candidate k highest scoring candidate's neologisms and step 8) in candidate it is new K frequency highest candidate's neologisms in set of words 2, and both union or common factor are taken, as the new set of words of seed candidate；

10) new word set to be verified is obtained according to union used by the new set of words of step 9) seed candidate or common factor mode Close, specifically include：

If the new set of words of seed candidate is by taking and mode set obtains in step 9), by the new set of words of seed candidate and candidate New set of words 1 takes and mode set obtains new set of words to be verified；

If the new set of words of seed candidate is by taking common factor mode to obtain in step 9), using the new set of words 1 of candidate as to be tested Demonstrate,prove new set of words.

11) the new set of words of the seed candidate of step 9) is substituted into word segmentation result 1, and set-up procedure 10) in it is to be verified new The frequency of occurrence of neologisms in set of words.

12) new set of words to be verified is traveled through, retains the neologisms that new set of words intermediate frequency time to be verified is more than 1, obtains final The new word list in field.

Further, the step 2) and the middle post processing of step 7) include：

Continuative participle unit string containing the Chinese figure time in text after word segmentation processing is merged into a word segmentation unit；

By in text after word segmentation processing containing any two kinds in English alphabet, numeral, hyphen and underscore and more than Continuative participle unit string merge into a word segmentation unit.

Further, candidate character string meets following condition simultaneously in the step 3)：

3.1) candidate character string is the continuative participle unit string started with Chinese character after step 2) processing, or to pass through Step 2) processing after by any two kinds in English alphabet, numeral, hyphen and underscore and the continuative participle unit string of the above A word segmentation unit being merged into；

3.2) candidate character string is that length is 2-4 word segmentation unit and is after step 2) processing containing at least one length The continuative participle unit string comprising Chinese character of 1 word segmentation unit；

3.3) candidate character string is the continuative participle unit string that stop words and punctuation mark are not contained after step 2) processing, The stop words includes conventional auxiliary word, preposition, multiword auxiliary verb, Chinese figure time word and Chinese numeral-classifier compound.

Further, cohesion degree puts association relationship most three times between using the various divisions of candidate character string in the step 4) Small value, it is assumed that candidate character string S is by word segmentation unit s₁…s_is_i+1…s_nComposition, then candidate character string S cohesions degree calculation formula is：

Wherein, MinMI³(S) be candidate character string S cohesion degree, P (S) be candidate character string S appear in text to be analyzed Probability, s₁…s_i, s_i+1…s_nFor candidate character string S one kind division, P (s₁…s_i) it is word string s₁…s_iAppear in text to be analyzed Probability in this.

Further, calculated in the step 4) using degrees of freedom using the normalized adjacent number that changes, candidate Word string S normalized adjoining changes number calculation and is：

Wherein NAV (S) is candidate character string S normalized adjacent change number, and LAV (S) is candidate character string S left adjacent change Change number, be defined as the number of S different forerunner's characters, the number occurred plus S in beginning of the sentence, RAV (S) is the candidate character string S right side Adjacent change number, is defined as the number of S different subsequent characters, the number occurred plus S in sentence tail, Count (S) is candidate word The number that string S occurs.

Further, the phrase probability of candidate character string described in the step 5) is：

Wherein, P_phrase(S) it is the phrase probability of candidate character string, candidate character string S is by word segmentation unit s₁s₂…s_is_i+1…s_nGroup Into P_BC(S) probability occurred for S in background corpus, P_BC(s_i) it is word segmentation unit S_iWhat is occurred in background corpus is general Rate；

Candidate character string S estimated in the probability occurred in background language material by n gram language models, estimate the method that uses for Probability with interpolation, the formula of use are as follows：

Wherein, P_BC(S) probability for being candidate character string S, P (s_i) it is word segmentation unit s_iThe probability of appearance, P (s_i-n+1…s_i-1) For s_iThe probability that occurs of preceding n-1 word segmentation units, λ is weighting parameters, and 0 ＜ λ ＜ 1, l are candidate character string S length.

Further, being scored into word for each candidate character string is calculated using following scoring formula in the step 6)：

Wherein, MinMI³(S) be candidate character string S cohesion degree, NAV (S) is candidate character string S use degrees of freedom, P_phrase(S) be candidate character string S phrase probability, α, β are parameter, span 0~1.

Further, the step 7) includes following parameter combination based on the CRF segmentation methods that word marks：

7.1) state of word is at least represented using tetra- kinds of marks of B, M, E, S, i.e. B represents the Chinese character of prefix, and M is represented in word Between Chinese character, E represents the Chinese character of suffix, and S represents that single Chinese character forms a word；

7.2) feature templates comprise at least following characteristic formp：If current Chinese character is C₀, watch window is front and rear 2 words, is seen The Chinese character sequence examined is C_-2C_-1C₀C₁C₂If current Chinese character is C₀, watch window is front and rear 2 words, and the Chinese character sequence of observation is C_- ₂C_-1C₀C₁C₂, the feature templates such as following table：

C_-2, C_-1, C₀, C₁, C₂	The individual character of current character and front and rear two character
		C_-2C_-1, C_-1C₀, C₀C₁, C₁C₂	Two character features of current watch window
C_-1C₀C₁	Current character adds three word features of front and rear character
		C_-1C₁	The left and right character feature of current character
T₀	The type of current character

Wherein, in feature templates current character type T₀Including：Chinese figure, English digital, letter, Chinese character and punctuate Symbol.

Further, in the step 9) when amount of text to be analyzed is less, using taking and mode set；When text to be analyzed During this negligible amounts, using taking common factor mode.

Further, the frequency of occurrence of neologisms specifically includes in adjustment new set of words to be verified in the step 11)：

Extraction step 3 successively) in extraction candidate character string s, and adjust as follows in new set of words to be verified new The frequency of occurrence of word：

If the candidate character string s extracted by method in step 3) is not belonging to new set of words to be verified, give up the candidate character string；

If the candidate character string s extracted by method in step 3) is located in new set of words to be verified, and in sentence and certain One seed candidate neologisms w is overlapped, then the s frequency is reduced into 1 in new set of words to be verified；

If the candidate character string s extracted by method in step 3) is located in new set of words to be verified, and completely includes a certain Seed candidate neologisms w, then the w frequency is reduced 1 in new set of words to be verified；

Wherein, determine whether that overlapping or inclusion relation includes：The position of each word segmentation unit in current sentence is marked out, is extracted During candidate character string, candidate character string and seed candidate neologisms beginning and end position are judged.

The present invention identifies possible neologisms, and base by the way that text statistical information and CRF word sequence labellings method is respectively adopted Filtered in background language material；The result of two kinds of recognizers is integrated as seed neologisms；Seed neologisms are updated to text It is middle to eliminate the candidate's neologisms being folded, finally select the new word list in optimal field.This method can exclude in urtext one Identify the mistake of multiple candidate's neologisms in position.Simultaneously by introducing background corpus, can solve statistical method can be by language In more typical word combination be mistakenly considered the new word problem in field.In addition, by combining two kinds of new word discovery methods and reducing frequency The influence of rate, some low-frequency field neologisms are more precisely identified compared to more existing method.Therefore, the present invention can The precision of field new word identification is largely improved, and the accuracy to low-frequency field new word identification can be lifted.

Brief description of the drawings

Fig. 1 is the method flow of the field new word identification provided in an embodiment of the present invention based on statistical information and sequence labelling Figure；

Fig. 2 is the method flow of the frequency of occurrence of neologisms in adjustment new set of words to be verified provided in an embodiment of the present invention Figure.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with inventive embodiments Accompanying drawing, the technical scheme in inventive embodiments is clearly and completely described, it is clear that the embodiments described below are only Only it is invention part of the embodiment, and not all embodiment.Based on the embodiment in invention, those of ordinary skill in the art exist The all other embodiment obtained under the premise of creative work is not made, belongs to the scope of invention protection.

Text in the present invention for new word identification system is the content for being related to specific area, " to be analyzed point in the present invention This " it is that user provides, for extracting the field related text of field neologisms.The full dose text in the field can be directly used, When objective condition does not allow, the field neologisms in sample text can also be only found by the use of part text as sample text.

By taking customer service system as an example, text to be analyzed is Customer Service people and the interactive dialogue in user's communication process Textual portions in record.Full dose text is the textual portions in interactive dialogue records all in system, and sample text can be determined Justice is the text in the part interactive dialogue record of sample decimation.

Referring to Fig. 1, Fig. 1 is a kind of field neologisms based on statistical information and sequence labelling provided in an embodiment of the present invention The method flow diagram of identification.

In a step 101, word and the frequency are counted to background corpus, obtains background word frequency dictionary and background binary continues Frequency dictionary.

This step is preparation process, and primary operational is to background corpus statisticses word frequency.During specific implementation.If obtain The word frequency dictionary and binary counted continues dictionary, then can use the dictionary.Skip this step.

In addition, the background language material in this step is by word segmentation processing and by manually proofreading, and expect neck with training The different language material in domain, all words occurred in background language material are not field neologisms.

In the present embodiment, background corpus uses State Language Work Committee's Modern Chinese balanced corpus.

When had a background word frequency dictionary and background binary continue frequency dictionary after, for be analyzed point of sheet, we Method uses following processing procedure.

In a step 102, carry out subordinate sentence to text to be analyzed, and the word included according to background corpus, using based on The Chinese Word Automatic Segmentation of dictionary carries out word segmentation processing and post processing to text to be analyzed, obtains word segmentation result 1.

Wherein, in this step " word segmentation result 1 " is the set of the word segmentation unit string generated after text word segmentation processing to be analyzed. Each sentence is changed into a word segmentation unit string after word segmentation processing in text to be analyzed.

" word segmentation unit " refers to a series of short characters that text-string obtains after word segmentation processing in the embodiment of the present invention String, each short character strings are to segment the word that device is thought.In order to be distinguished with the concept of " word " in usual linguistic context, this hair Word segmentation unit is especially referred to as in bright.

This step can use any method based on dictionary, such as reverse maximum matching method, most probable number method, N metagrammars The methods of.But dictionary used in requiring derives from background corpus.When it is implemented, Jieba participles can be used (to correspond to most Maximum probability method), ICTCLAS participles (corresponding N metagrammars) or MMSeg participles (corresponding reverse maximum matching method), and use background The dictionary of language material is as dictionary for word segmentation.

More preferably, in the specific implementation, after word segmentation processing, in order to further lift the effect of participle identification.Need Carry out following post-processing operation.

Continuative participle unit string containing the Chinese figure time in text after word segmentation processing is merged into a word segmentation unit, And marking types.Such as continuous word segmentation unit string is " six/moon ", merges into a word segmentation unit：June.

By in text after word segmentation processing containing any two kinds in English alphabet, numeral, hyphen and underscore and more than Continuative participle unit string merge into a word segmentation unit.Such as：Continuous word segmentation unit string is " i/P/h/o/n/e/6/s/ ", I/P/h/o/n/e/6/s/ is then merged into a word segmentation unit：iPhone6s.

In step 103, choosing in word segmentation result 1 meets the continuative participle unit string of condition as candidate character string.

" segmentation fragment " refers to during participle in the present invention, due to can not correctly identify neologisms by neologisms as character String, by the word segmentation unit sequence formed after dictionary cutting.The segmentation fragment that length is 1 is referred to as individual character segmentation fragment.It is multiple continuous Word segmentation unit form word segmentation unit string.

The word string of this step extraction gained is referred to as candidate character string.Wherein contain candidate's neologisms；Candidate word in addition to neologisms String is referred to as rubbish string.

Neologisms are not present in the dictionary of participle instrument.Therefore segmentation fragment can be changed into after word segmentation processing.Through undue After word processing, the inventive method thinks that neologisms will not exist in the form of complete after word segmentation processing, and at least occurs 1 Individual individual character segmentation fragment.If " client is negative control purchase electricity user, " is after word segmentation processing." client/for/bear/control/purchase/electricity/use Family/,/", wherein generating individual character segmentation fragment " negative " " control " " purchase " " electricity ".

According to the rule of Chinese and the adaptability of this method, the continuative participle unit string in the present embodiment meets following simultaneously Condition can just be used as candidate character string.

Condition 1：Candidate character string be by step 102 processing after the continuative participle unit string started with Chinese character, Huo Zhewei After step 2) processing by English alphabet, numeral, hyphen and underscore in any two kinds and the above continuative participle list The word segmentation unit that bit string is merged into.

Condition 2：Candidate character string is that length is 2-4 word segmentation unit and contains at least one length after step 102 processing The continuative participle unit string comprising Chinese character of the word segmentation unit for 1 is spent, therefore in addition to individual character segmentation fragment, it is also possible to comprising in addition 0-3 word segmentation unit.

Condition 3：Candidate character string is the continuative participle list for not containing stop words and punctuation mark after step 102 processing Bit string, the stop words include conventional auxiliary word, preposition, multiword auxiliary verb, Chinese figure time word and Chinese numeral-classifier compound.

Think stop words in the inventive method and punctuation mark is mark that nature disconnects.If run into during collecting Punctuation mark then stops, meanwhile, numeral and the alphabetical position occurred in the neologisms of field are fixed, therefore candidate character string is entered The constraint of condition 1 and condition 3 is gone.

The deactivation vocabulary that this step uses can obtain by user configuration, during beginning from external data source, and stop words can be with Including following several：

1st, commonly use auxiliary word, conventional auxiliary word includes structural auxiliary word, tense auxiliary word and auxiliary words of mood, structural auxiliary word such as ", , institute ", tense auxiliary word such as ", cross ", auxiliary words of mood as ",, ".

2nd, preposition, used in the front of noun, pronoun or noun phrase, direction, the word of object are represented altogether, such as from, From, past, court, when (direction, place or time), to, with, be (object or purpose), by, according to (in a manner of) with, with, together (comparison), quilt, cry, allow (passive).

3rd, multiword auxiliary verb, such as：, to think, meeting, energy, can with, should, should, should, (dei).

4th, negative adverb, such as：Not, or not never, need not, it is non-, do not have, do not have, not, may not, not, have no way of.

5th, the time word containing numeral, such as：January, February, one hour, ten minutes.

6 and other feel the need in practice add stop words.

The stop words that this step uses can be obtained by user configuration when field new word identification starts from external data source Take.

Such as " client/for/negative/control/purchase/electricity/user/,/", wherein " for " is off word, therefore can extract as follows Candidate character string.

" negative control "	" negative control purchase "	" negative control purchase electricity "
			" control purchase "	" control purchase electricity "	" control purchase electricity user "
" power purchase "	" power purchase user "
			" electric user "

More preferably, in the specific implementation, it is the effect of further lifting this method, candidate's neologisms can be made further about Beam：Word segmentation unit string includes 2-6 character altogether；The individual character segmentation fragment wherein included does not include individual character stop words.

In step 104, to each candidate character string, the string cohesion degree is calculated and using degrees of freedom.

Cohesion degree represents that the candidate character string is to be more prone to occur with an entirety, or they are jointly The result of random combine, reflect word and combine stable characteristic.Using degrees of freedom represent word string whether can and other Different words combines appearance in sentence.Reflect the characteristic of word use flexibly.

Cohesion degree is using mutual information three times as computational methods.To two word segmentation units x, y, three times mutual information MI3 (x, Y) calculation formula is：

Wherein P (x) is word segmentation unit x probability of occurrence, and P (y) is word segmentation unit y probability of occurrence.P (x, y) is x, and y is common With the probability occurred.

Word segmentation unit s₁s₂…s_is_i+1…s_nThe candidate character string S of composition cohesion degree is the various division (s of the word string₁s₂… s_i, s_i+1…s_n, 1≤i ＜ n) in, the minimum value of mutual information three times.Computational methods are：

Such as the candidate character string S ' containing three word segmentation units=" negative/control/purchase/", two kinds of divisions be present, be respectively：(” It is negative/", " control/purchase/"), (" negative/control/", " purchase/").Then S ' cohesion degree MinMI³(" negative/control/purchase/") is MI³(" negative/", " control/purchase/") and MI³The minimum value of (" negative/control/", " purchase/").

Calculated using degrees of freedom using the normalized adjacent number that changes, candidate character string S normalized adjacent change number NAV (S) calculation is:

Wherein, LAV (S) is candidate character string S left adjacent change number, is defined as the number of S different forerunner's characters, adds The number that S occurs in beginning of the sentence, the right adjacent change number that RAV (S) is candidate character string S, the number of fixed/adopted different subsequent characters for S Mesh, the number occurred plus S in sentence tail.Count (S) is the number that candidate character string S occurs.

Using character string, " client is negative control purchase electricity user." in exemplified by caused part candidate character string, in text to be analyzed, Candidate character string " negative/control/purchase/electricity/" occurs 20 times, and " control/purchase/" occurs 20 times, and " negative/control/purchase/" occurs 20 times, " control/purchase/electricity/" occurs 20 times.Simultaneously in text to be analyzed, " bearing " only occurs in the left side of candidate character string " control/purchase/" Word, only there is " electricity " word in the right, therefore NAV (" control purchase ") is 0.05；The left side of candidate character string " negative/control/purchase/" occurs 11 The different participle fragment of kind, only there is " electricity " word in the right, therefore NAV (" negative control purchase ") is 0.05；Candidate character string " control/purchase/ The left side of electricity " only occurs " bearing " word, and 15 kinds of different participle fragments occurs in the right.NAV (" control purchase electricity ") is 0.05；Candidate There are 11 kinds of different participle fragments in word string " negative/control/purchase/electricity/" left side, and 15 kinds of different participle fragments occurs in the right. NAV (" negative control purchase electricity ") is 6.05.

The method of left adjacent word (word) entropy and right adjacent word (word) entropy that are used compared to other method, being bordered by change number can be more The word that can independently use is judged exactly.

In step 105, according to the background word frequency dictionary and background binary that are obtained in step 1 continue frequency dictionary calculate wait The phrase probability of word selection string.

The phrase probability of candidate character string reflects whether current candidate word string may occur as phrase, and phrase is general Rate is bigger, then candidate character string is smaller as the probability of neologisms.

The phrase method for calculating probability of candidate character string is：

Wherein, P_phrase(S) it is the phrase probability of candidate character string, candidate character string S is by word segmentation unit s₁s₂…s_is_i+1…s_nGroup Into.P_BC(S) probability occurred for S in background corpus, P_BC(s_i) it is word segmentation unit S_iWhat is occurred in background corpus is general Rate.The formula is an empirical equation.

Candidate character string S is in the probability P occurred in background language material_BC(S) can be estimated by n gram language models.Estimation is adopted Method is the probability with interpolation, and the formula of use is as follows：

Wherein, P_BC(S) probability for being candidate character string S, wherein, P_BC(S) probability for being candidate character string S, P (s_i) single to segment Position s_iThe probability of appearance, P (s_i-n+1…s_i-1) it is s_iPreceding n-1 word segmentation unit occur probability.N is the rank of n gram language models Number.2 are taken in this method.λ is weighting parameters, and 0 ＜ λ ＜ 1, l are candidate character string S length.Experiment shows, in the present embodiment, the back of the body λ takes 0.95 available preferably result in scape language material.

In step 106, combining step 104 and step 105 calculate the parameter of gained, calculate being commented into word for candidate character string Point, it is more than predetermined threshold value T1 result as the new set of words 1 of candidate using into word scoring.

This method calculates score using formula below：

Wherein, MinMI³(S) be candidate character string S cohesion degree, NAV (S) is candidate character string S use degrees of freedom, α, β are empirical parameter, and span 0~1, T1 span is 2.4~4.8.In specific implementation, α values be 0.2~ 0.6。

The new set of words 1 of candidate is the new set of words of candidate, includes the morphology of each candidate's neologisms, frequency of occurrence and final score Information.

In step 107, using the segmentation methods marked based on word, word segmentation processing is carried out to text to be analyzed, CRF is marked After the result of note is decoded and obtains word segmentation result, it is also necessary to progress and step 102 identical post-processing operation, segmented As a result 2.

" word segmentation result 2 " is the set of the word segmentation unit string generated after text word segmentation processing to be analyzed in this step.Treat point Each sentence is changed into a word segmentation unit string after this step word segmentation processing in analysis text.

The segmentation methods use linear chain condition random field (Linear-chain CRF) model, hereinafter referred to as CRF models, Each Chinese character is put into lexeme and is labeled.

When using CRF models, following parameter should be used：

1. the state of word is represented using tetra- kinds of marks of B, M, E, S.That is B represents the character of prefix；M represents the word among word Symbol；E represents the character of suffix；S represents that single character forms a word.

2. feature templates include following characteristic formp：If current character is C₀, watch window 2 words, Chinese of observation for before and after Word sequence is C_-2C_-1C₀C₁C₂。

C_-2, C_-1, C₀, C₁, C₂	The individual character of current character and front and rear two word
		C_-2C_-1, C_-1C₀, C₀C₁, C₁C₂	Two word features of current watch window
C_-1C₀C₁	Current character adds three front and rear word features
		C_-1C₁	The left and right word feature of current character
T₀	The type of current character

The type T of wherein current word₀It is defined as：Chinese figure, English digital, letter, other Chinese characters, punctuation mark and its His six kinds of symbol.

The training corpus of CRF models can use the background language material in step 101.

Using character string, " client is negative control purchase electricity user." exemplified by.In this step, the segmentation methods needs pair based on word mark The following feature of text string generation.Wherein " B-2 " " B-1 " is the virtual tag at sentence beginning." E+1 " " E+2 " is sentence end Virtual tag.

CRF models can be that each Chinese character in character string assigns a mark, and to character string, " client is that negative control purchase electricity is used Family." be marked, as a result for：

Character

Visitor

Family

For

It is negative

Control

Purchase

Electricity

With

Family

。

Mark

B

E

S

B

E

B

E

B

E

S

Mark is decoded, following word segmentation unit string can be obtained：" client/being/negative control/power purchase/user/./”.

In step 108, the word segmentation unit in word segmentation result 2 is screened using background language material dictionary, to not carrying on the back Occur in scape language material dictionary, and be unsatisfactory for the word segmentation unit of stop words rule, count the frequency of appearance.Obtain the new set of words of candidate 2。

Stop words rule in this step is：Belong to time word or Chinese figure word with numeral.

It is the new set of words of candidate in the new set of words 2 of candidate, morphology and frequency of occurrence information comprising each candidate's neologisms.

In step 109, the k candidate's neologisms and step of highest scoring in the new set of words 1 of candidate in step 106 are taken respectively K candidate's neologisms of frequency highest in the new set of words 2 of candidate in 108, and both unions (or common factor) are asked for, as seed The new set of words of candidate.

The new set of words of seed candidate is the higher neologisms result of quality, can think neologisms therein all without examination ＆ verification It is effective.Wherein contain the morphology and frequency of occurrence information of each candidate's neologisms.

Candidate character string s frequency information frequency the greater in the new set of words 1 of candidate and in the new set of words 2 of candidate is defined.

More preferably, in the specific implementation, when amount of text to be analyzed is less.Using taking and mode set can obtain preferably Result；When amount of text to be analyzed is more.Using taking common factor mode to obtain preferable result, in this step, k value takes 5-20。

In step 110, treated according to union used by the new set of words of step 109 seed candidate or common factor mode to obtain New set of words is verified, is specifically included：

If the new set of words of seed candidate is by taking and mode set obtains in step 109, by the new set of words of seed candidate and time Select that new set of words 1 takes and mode set obtains new set of words to be verified.

If the new set of words of seed candidate is by taking common factor mode to obtain in step 109, using the new set of words 1 of candidate as to be tested Demonstrate,prove new set of words.

In step 111, the new set of words of the seed candidate of step 109 is substituted into word segmentation result 1, and set-up procedure 110 In in new set of words to be verified neologisms frequency of occurrence.

In step 112, new set of words to be verified is traveled through, retains the neologisms that new set of words intermediate frequency time to be verified is more than 1, obtains To the final new word list in field.

It can understand for step 111 and step 112 with reference to Fig. 2.

In step 111-1, the candidate character string s that is extracted in extraction step 103.

In step 111-2, judge whether candidate character string s belongs to new set of words to be verified, if so, then performing step 111- 3, if it is not, then performing step 111-7.

In step 111-3, judge whether candidate character string s overlaps in sentence with a certain seed candidate neologisms w, if so, Step 111-4 is then performed, if it is not, then performing step 111-5.

In step 111-4, the s frequency is reduced 1 in new set of words to be verified.

In step 111-5, it is a certain in the new set of words of seed candidate to judge whether candidate character string s completely includes in sentence Seed candidate neologisms w, if so, step 111-6 is then performed, if it is not, then performing step 111-7.

In step 111-6, the w frequency is reduced 1 in new set of words to be verified.

In step 111-7, judge whether to complete the analysis of all candidate character strings in step 103, if so, then performing step 112, if it is not, then jumping to step 111-1.Circulated with this, until having analyzed all candidate character strings in step 103.

In step 112, retain the neologisms that new set of words intermediate frequency time to be verified is more than 1, as field neologisms, obtained with this The new word list in final field.

For step 111, understood referring also to following examples.

Such as have sentence " I// melt/certificate/debt// the Changjiang river/security/length/net/website/and/mobile phone/software/,/net Upper/transaction/software/display// debt/amount of money/or not consistent// " and in the sentence, wherein candidate character string " fourdrinier wire " " raising stocks " In the new set of words of seed candidate.

Word segmentation result 1 is analyzed, extracts candidate character string using with step 103 identical method every time.

If candidate character string s continues to obtain next candidate character string not in new set of words to be verified；Otherwise to the time Word selection string s is analyzed.

For example, according to the candidate character string " security/length/net/" of step 103 extraction not in new set of words to be verified, therefore Give up the candidate character string, continue to obtain next candidate character string.

If candidate character string s is produced with some candidate's neologisms w in the new set of words of seed candidate and overlapped, by candidate Word string s frequency reduces 1.

For example, the candidate character string " net/website/" extracted according to step 103 is waited in new set of words to be verified, but with seed The position for selecting neologisms " length/net/" to occur in sentence generate it is overlapping, therefore to candidate character string " net/website/" in the new word set of candidate The frequency for closing 1 subtracts 1.

If candidate character string s contains some candidate neologisms w in the new set of words of seed candidate, to be verified new The w frequency is reduced 1 in set of words.

Such as the candidate character string " melting/certificate/debt/" extracted according to step 103 but contains in new set of words to be verified Seed candidate neologisms " melt/certificate/", now, operate new set of words to be verified, the frequency that seed candidate neologisms " melting/certificate/" are occurred Subtract 1.

After counting current candidate word string, attempt to obtain next candidate character string, until text to be analyzed has traveled through Finish.

When it is implemented, the position of candidate character string in current sentence can be marked out, when extracting candidate character string, both are opened Begin and end position is judged, determine whether overlapping or inclusion relation.

In this example, position of the seed candidate neologisms " fourdrinier wire " " raising stocks " in segmentation fragment is 3-4.9-10.Then body carries When taking candidate's neologisms, the subscript of each candidate's neologisms word string right boundary is exported, 3,9 candidate character string is designated as under every right margin Overlapped with the candidate character string that 4,10 are designated as under left margin with candidate's string.Every left margin subscript is less than 3 and right margin subscript is big Candidate character string in 4 all includes " raising stocks ".

To sum up, based on above-described embodiment, the present invention can exclude a position in urtext and identify that multiple candidates are new The mistake of word.Simultaneously by introducing background corpus, can solve statistical method can miss word combination more typical in language Think the new word problem in field.In addition, by combining two kinds of new word discovery methods and reducing the influence of frequency, compared to more existing Method is easier to identify some low-frequency field neologisms.Therefore, the present invention can largely improve field new word identification Precision, and the accuracy to low-frequency field new word identification can be lifted.

Above-described embodiment, the purpose of the present invention, technical scheme and beneficial effect are carried out further Describe in detail, should be understood that the embodiment that the foregoing is only the present invention, be not intended to limit the present invention Protection domain, within the spirit and principles of the invention, any modification, equivalent substitution and improvements done etc., all should include Within protection scope of the present invention.

Claims

1. a kind of field new word identification method based on statistical information and sequence labelling, it is characterised in that comprise the following steps：

1) word and the frequency are counted to background corpus, to obtain background word frequency dictionary and background binary continues frequency dictionary, institute The corpus for stating background corpus by word segmentation processing and manually to proofread；

2) subordinate sentence, the word then included according to background corpus, using based on step are carried out to the text to be analyzed that user provides It is rapid 1) in dictionary Chinese Word Automatic Segmentation to after subordinate sentence text carry out word segmentation processing to obtain multiple word segmentation units, to multiple Word segmentation unit is post-processed, and obtains word segmentation result 1；

6) parameter of gained is calculated according to step 4) and step 5), being scored into word for each candidate character string is calculated, will be scored into word For candidate character string more than predetermined threshold value T1 as the new set of words 1 of candidate, the new set of words 1 of candidate is the set of candidate's neologisms, Morphology, frequency of occurrence and score including candidate's neologisms；

7) subordinate sentence, the word then included according to background corpus, using based on word are carried out to the text to be analyzed that user provides The segmenting method of mark, word segmentation processing is carried out to the text after subordinate sentence to obtain multiple word segmentation units, multiple word segmentation units are entered Row post processing, obtains word segmentation result 2；

8) expect that dictionary screens to the word segmentation unit in word segmentation result 2 using background in step 1), count not in background language Occur and be unsatisfactory for the frequency that the word segmentation unit of stop words rule occurs in material dictionary, obtain the new set of words 2 of candidate, the candidate New set of words 2 is the set of candidate's neologisms, includes the morphology and frequency of occurrence of candidate's neologisms；

9) take respectively in step 6) in the new set of words 1 of candidate k highest scoring candidate's neologisms and step 8) in the new word set of candidate K frequency highest candidate's neologisms in closing 2, and both union or common factor are taken, as the new set of words of seed candidate；

10) new set of words to be verified is obtained according to union used by the new set of words of step 9) seed candidate or common factor mode, Specifically include：

If the new set of words of seed candidate is by taking and mode set obtains in step 9), by the new set of words of seed candidate and candidate's neologisms Set 1 takes and mode set obtains new set of words to be verified；

If the new set of words of seed candidate is by taking common factor mode to obtain in step 9), using the new set of words 1 of candidate as to be verified new Set of words.

11) the new set of words of the seed candidate of step 9) is substituted into word segmentation result 1, and set-up procedure 10) in new word set to be verified The frequency of occurrence of neologisms in conjunction.

12) new set of words to be verified is traveled through, retains the neologisms that new set of words intermediate frequency time to be verified is more than 1, obtains final field New word list.

2. a kind of field new word identification method based on statistical information and sequence labelling according to claim 1, its feature It is, being post-processed in the step 2) and step 7) includes：

Any two kinds and the company of the above in English alphabet, numeral, hyphen and underscore will be contained in text after word segmentation processing Continuous word segmentation unit string merges into a word segmentation unit.

3. a kind of field new word identification method based on statistical information and sequence labelling according to claim 1, its feature It is, candidate character string meets following condition simultaneously in the step 3)：

3.1) candidate character string is the continuative participle unit string started with Chinese character after step 2) processing, or for by step 2) the continuative participle unit string by any two kinds in English alphabet, numeral, hyphen and underscore and the above after handling merges Into a word segmentation unit；

3.2) candidate character string is that length is 2-4 word segmentation unit and is 1 containing at least one length after step 2) processing The continuative participle unit string comprising Chinese character of word segmentation unit；

3.3) candidate character string is the continuative participle unit string that stop words and punctuation mark are not contained after step 2) processing, described Stop words includes conventional auxiliary word, preposition, multiword auxiliary verb, Chinese figure time word and Chinese numeral-classifier compound.

4. a kind of field new word identification method based on statistical information and sequence labelling according to claim 1, its feature It is, cohesion degree is using the minimum value for putting association relationship between the various divisions of candidate character string three times in the step 4), it is assumed that wait Word selection string S is by word segmentation unit s₀s₁…s_is_i+1…s_nComposition, then candidate character string S cohesions degree calculation formula is：

<mrow> <msup> <mi>MinMI</mi> <mn>3</mn> </msup> <mrow> <mo>(</mo> <mi>S</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> </mrow> <mrow> <mn>0</mn> <mo>&le;</mo> <mi>i</mi> <mo><</mo> <mi>n</mi> </mrow> </munder> <mrow> <mo>(</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mfrac> <msup> <mrow> <mo>(</mo> <mi>P</mi> <mo>(</mo> <mi>S</mi> <mo>)</mo> <mo>)</mo> </mrow> <mn>3</mn> </msup> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mn>1</mn> </msub> <mo>...</mo> <msub> <mi>s</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mrow> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>...</mo> <msub> <mi>s</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>)</mo> </mrow> </mrow>

Wherein, MinMI³(S) be candidate character string S cohesion degree, P (S) be candidate character string S appear in it is general in text to be analyzed Rate, s₁…s_i, s_i+1…s_nFor candidate character string S one kind division, P (s₁…s_i) it is word string s₁…s_iAppear in text to be analyzed Probability.

5. a kind of field new word identification method based on statistical information and sequence labelling according to claim 1, its feature It is, is calculated in the step 4) using degrees of freedom using the normalized adjacent number that changes, candidate character string S normalization Adjacent change number calculation be：

Wherein, NAV (S) is candidate character string S normalized adjacent change number, and LAV (S) is candidate character string S left adjacent change Number, is defined as the number of S different forerunner's characters, the number occurred plus S in beginning of the sentence, and RAV (S) is candidate character string S right neighbour Change number is connect, is defined as the number of S different subsequent characters, the number occurred plus S in sentence tail, Count (S) is candidate character string The number that S occurs.

6. a kind of field new word identification method based on statistical information and sequence labelling according to claim 1, its feature It is, the phrase probability of candidate character string is in the step 5)：

Wherein, P_phrase(S) it is the phrase probability of candidate character string, candidate character string S is by word segmentation unit s₁s₂…s_is_i+1…s_nComposition, P_BC(S) probability occurred for S in background corpus, P_BC(s_i) it is word segmentation unit S_iThe probability occurred in background corpus；

Candidate character string S estimated in the probability occurred in background language material by n gram language models, estimate the method that uses for The probability of interpolation, the formula of use are as follows：

<mrow> <msub> <mi>P</mi> <mrow> <mi>B</mi> <mi>C</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>S</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Pi;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>l</mi> </munderover> <mi>&lambda;</mi> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>s</mi> <mrow> <mi>i</mi> <mo>-</mo> <mi>n</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>...</mo> <msub> <mi>s</mi> <mrow> <mi>i</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mrow> <mi>i</mi> <mo>-</mo> <mi>n</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>...</mo> <msub> <mi>s</mi> <mrow> <mi>i</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> </mrow>

Wherein, P_BC(S) probability for being candidate character string S, P (s_i) it is word segmentation unit s_iThe probability of appearance, P (s_i-n+1…s_i-1) it is s_i's The probability that preceding n-1 word segmentation unit occurs, λ are weighting parameters, and 0 ＜ λ ＜ 1, l are candidate character string S length.

7. a kind of field new word identification method based on statistical information and sequence labelling according to claim 1, its feature It is, being scored into word for each candidate character string is calculated using following scoring formula in the step 6)：

<mrow> <mi>S</mi> <mi>c</mi> <mi>o</mi> <mi>r</mi> <mi>e</mi> <mrow> <mo>(</mo> <mi>S</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <msup> <mi>&alpha;MinMI</mi> <mn>3</mn> </msup> <mrow> <mo>(</mo> <mi>S</mi> <mo>)</mo> </mrow> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>&alpha;</mi> <mo>)</mo> </mrow> <mi>N</mi> <mi>A</mi> <mi>V</mi> <mrow> <mo>(</mo> <mi>S</mi> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>&beta;P</mi> <mrow> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>S</mi> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>N</mi> <mi>A</mi> <mi>V</mi> <mrow> <mo>(</mo> <mi>S</mi> <mo>)</mo> </mrow> <mo>&GreaterEqual;</mo> <mn>1</mn> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mn>0</mn> <mi>N</mi> <mi>A</mi> <mi>V</mi> <mrow> <mo>(</mo> <mi>S</mi> <mo>)</mo> </mrow> <mo><</mo> <mn>1</mn> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow>

Wherein, MinMI³(S) be candidate character string S cohesion degree, NAV (S) is candidate character string S use degrees of freedom, P_phrase (S) be candidate character string S phrase probability, α, β are parameter, span 0~1.

8. a kind of field new word identification method based on statistical information and sequence labelling according to claim 1, its feature It is, the step 7) includes following parameter combination based on the CRF segmentation methods that word marks：

7.1) state of word is at least represented using tetra- kinds of marks of B, M, E, S, i.e. B represents the Chinese character of prefix, and M is represented among word Chinese character, E represent the Chinese character of suffix, and S represents that single Chinese character forms a word；

7.2) feature templates comprise at least following characteristic formp：If current Chinese character is C₀, watch window 2 words, Chinese of observation for before and after Word sequence is C_-2C_-1C₀C₁C₂If current Chinese character is C₀, watch window is front and rear 2 words, and the Chinese character sequence of observation is C_-2C_- ₁C₀C₁C₂, the feature templates such as following table：

Wherein, in feature templates current character type T₀Including：Chinese figure, English digital, letter, Chinese character and punctuation mark.

9. a kind of field new word identification method based on statistical information and sequence labelling according to claim 1, its feature It is, in the step 9) when amount of text to be analyzed is less, using taking and mode set；When amount of text to be analyzed is less When, using taking common factor mode.

10. a kind of field new word identification method based on statistical information and sequence labelling according to claim 1, its feature It is, the frequency of occurrence of neologisms specifically includes in adjustment new set of words to be verified in the step 11)：

Extraction step 3 successively) in extraction candidate character string s, and adjust neologisms in new set of words to be verified as follows Frequency of occurrence：

If the candidate character string s extracted by method in step 3) is located in new set of words to be verified, and in sentence and a certain Sub- candidate's neologisms w is overlapped, then the s frequency is reduced into 1 in new set of words to be verified；

Wherein, determine whether that overlapping or inclusion relation includes：The position of each word segmentation unit in current sentence is marked out, extracts candidate During word string, candidate character string and seed candidate neologisms beginning and end position are judged.