CN103294664A

CN103294664A - Method and system for discovering new words in open fields

Info

Publication number: CN103294664A
Application number: CN2013102791845A
Authority: CN
Inventors: 陈飞; 刘奕群; 马少平; 张敏; 金奕江; 张阔
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2013-07-04
Filing date: 2013-07-04
Publication date: 2013-09-11

Abstract

The invention provides a method and a system for discovering new words in open fields. The method includes receiving corpora to be processed, conducting format conversion and word segmentation processing on the corpora to obtain a plurality of text messages, extracting characteristic messages of the plurality of text messages, judging whether combinations of adjacent text messages of a part of the text messages in the plurality of text messages are new words, conducting new word boundary labeling on the adjacent text messages on yes judgment, estimating parameters of a conditional random field model according to the labeled text messages and characteristic messages and identifying the surplus text messages according to the estimated conditional random field model to obtain new words of the surplus text messages. By means of the method, the new word boundary labeling is conducted on the text messages, the parameters of the conditional random field model are estimated, the plurality of text messages are identified to obtain the new words in the text messages, the new words in various fields can be identified, and meanwhile identification efficiency is improved.

Description

The method and system of open field new word discovery

Technical field

The present invention relates to the information intelligent processing technology field, particularly a kind of method and system of opening the field new word discovery.

Background technology

Because Chinese unlike western languages such as English, does not have fixing separator, so the steps necessary that participle begins as the Chinese information processing task usually most between word and the word.Existing studies show that, the performance of word (be unregistered word, this paper indication neologisms belong to unregistered word) the meeting appreciable impact participle that the participle instrument dictionary that runs in the branch word task does not comprise.Therefore, new word discovery is for the raising participle, so that follow-up work has important meaning.In addition, the appearance of web2.0 application such as individual blog, individualized signature, microblogging in recent years allows user oneself to generate web page contents, causes being similar to new terms such as " refreshing horse ", " Super Girl " to occur in a large number, and with very fast speed renewal, make new word discovery face more challenges.

About the research of new word discovery, mainly concentrate on the automatic extraction of name, place name, translation abbreviation or certain several field term (for example, fields such as military affairs, finance and economics) at present.According to the scope of new word discovery task, it can be divided three classes: 1.one-for-one, this class research mainly solves the new word discovery of certain specific question, for example, name, place name etc.2.one-for-several, mainly solve the new word discovery in several certain kinds or field.3.one-for-all, towards all problems, namely open the new word discovery in the field.Classify according to its method that adopts, mainly contain following three classes: 1. based on the n-gram language model.2. depend on after certain participle instrument participle, carry out new word discovery.3. new word discovery and participle instrument are carried out combination, in participle, detect neologisms.In addition, also have with the analysis user behavior, adopt the method for collaborative filtering to carry out new word discovery.Except above-mentioned sorting technique, also new word discovery can be divided into based on language rule with based on the method for statistical machine study, this also is the mode classification of present main flow.

Because the fast development of media of communication such as internet, neologisms also produce with the speed that is exceedingly fast, and change various, be difficult to the pattern rule coupling, even generated rule template, because the neologisms renewal is very rapid, and the composition rule variation can make the very fast inefficacy of template.On the other hand, proposition and the maintenance of rule need be undertaken by the linguist, not only expend time in, money, and can not expand.Therefore present new word discovery work, mainly concentrate under certain or certain several special dimensions the method according to statistical machine study, (for example introduce domain knowledge, military field distinctive " fighter plane ", " tank ", field of finance and economics distinctive " stock ", " listing " etc.) carry out the new word discovery of specific area.Rely on the markup information of domain knowledge and these field neologisms, the method of machine learning can better be acquired the model that is fit to this field new word discovery, even if the method for rule-based coupling also can propose more rules specific to this field word-building, thereby improve accuracy and the recall rate of identification.And for open field new word discovery, its neologisms identifying is towards all possible field, or even picture " refreshing horse " is this not at the neologisms of any specific area, and therefore current research is less by contrast.Its maximum difficult point is no matter be rule match or statistical machine study, all do not have available domain knowledge to be optimized at new word discovery.Because the division in field and possible quantity thereof are uncertain, even exist certain than more comprehensive field division rule, judge that it also is very difficult that neologisms to be found belong to certain field, these have all increased the difficulty of open field neologisms identification undoubtedly.

Conditional random field models (CRF) is as a kind of probability graph model, its principle is to decompose by the conditional probability distribution function p (y|x) to list entries (observational variable) x and output (hidden variable) y, rather than directly decompose joint probability distribution p (x, y), thus avoid to the input probability distribution p (x) carry out modeling and calculating.Because input x has the relation of more complicated usually, the calculating of p (x) is often needed input is had certain independence assumption.And CRF has avoided the calculating to p (x) by direct decomposition condition probability distribution p (y|x), so it can better learn the relation of data centralization.

In linear chain CRF, with p (x, y) do following decomposition:

p (y | x; ω) = \frac{1}{Z (x, ω)} \exp Σ_{i = 1}^{N} \underset{j}{Σ} ω_{j} f_{j} (y_{i - 1}, y_{i}, x, i), Z (x, y) = \underset{Z_{1, N}}{Σ} \exp Σ_{i = 1}^{N} \underset{j}{Σ} ω_{j} f_{j} (y_{i - 1}, y_{i}, x, i),

Wherein, f _jBe fundamental function, i represents list entries x at fundamental function f _jLast summation can guarantee like this for elongated input f _jThe fundamental function value of estimating the j number is arranged, and ω is for treating estimated parameter, and (x ω) is normalized factor, ω to Z _jFor treating estimated parameter, y _iBe the output of CRF model, N is the total sample number amount.Though fundamental function f in theory, _jIn can produce relation with all x, but when reality is used, consider in complicacy and the practical problems feature of relation between the input, what may select only is that current input and forward and backward one or two input are as the independent variable of this fundamental function.The advantage of CRF is, need not suppose to import the independence relation between the x, just can calculate p (y|x, w).And the relation between input and the output is the fundamental function f by user's appointment in specific task of CRF _jAnd the parameter w of the automatic study of CRF _jEmbody.Linear chain CRF then has the certain condition restriction to CRF: current output y _iExcept having the funtcional relationship with x, can only with previous output y _I-1Relevant.In the new word discovery task, need the current word of prediction whether can constitute neologisms with contiguous word and (namely export y _i), its result not only depends on the feature value (namely importing x) of these several words, and also depending on predicting the outcome of a last word (is y _I-1), because whether a last word is predicted to be the prediction that neologisms can influence current word, the model of this and linear chain CRF just in time coincide.

Summary of the invention

Purpose of the present invention is intended to solve at least one of above-mentioned technological deficiency.

For this reason, one object of the present invention is to propose a kind of method of opening the field new word discovery.

Another object of the present invention is to propose a kind of system that opens the field new word discovery.

For achieving the above object, the embodiment of one aspect of the present invention proposes a kind of method of opening the field new word discovery, may further comprise the steps: receive pending language material, and described language material is carried out format conversion and word segmentation processing, to obtain a plurality of text messages; Extract the characteristic information of described a plurality of text messages; Whether the combination of judging the adjacent text message of a part of text message in described a plurality of text message is neologisms; If then described adjacent text message is carried out neologisms border mark; Parameter according to the described a plurality of text messages behind the mark and characteristic information estimation conditional random field models; Parameter according to the described conditional random field models of estimating is identified residue text message in described a plurality of text messages, to obtain the neologisms of residue text message in described a plurality of text message.

Method according to the embodiment of the invention, by text message being carried out neologisms border mark, estimate the parameter of conditional random field models, and a plurality of text messages are identified to obtain neologisms in a plurality of text messages, can identify the neologisms in various fields, improve the efficient of identification simultaneously.

In one embodiment of the invention, described characteristic information comprises left information entropy and right information entropy.

In one embodiment of the invention, described characteristic information also comprises the additional feature information of handling by described left information entropy and right information entropy.

In one embodiment of the invention, described left information entropy represents by following formula,

LE (w) = - \frac{1}{n} \underset{a &Element; A}{Σ} C (a, w) \log \frac{C (a, w)}{n},

Wherein, LE (w) is left information entropy, and w represents this word, and A is the set that is positioned at the word on the w left side in the corpus, and (n is positive integer to C for a, the w) number of times that word a and w occur simultaneously in the expression corpus.

In one embodiment of the invention, described right information entropy represents by following formula,

RE (w) = - \frac{1}{n} \underset{a &Element; B}{Σ} C (w, a) \log \frac{C (w, a)}{n},

Wherein, RE (w) is right information entropy, and w represents this word, and B is the set that is positioned at the word on w the right in the corpus, and (n is positive integer to C for a, the w) number of times that word w and a occur simultaneously in the expression corpus.

For achieving the above object, embodiments of the invention propose a kind of system that opens the field new word discovery on the other hand, comprising: word-dividing mode is used for receiving pending language material, and described language material is carried out format conversion and word segmentation processing, to obtain a plurality of text messages; Extraction module is for the characteristic information that extracts described a plurality of text messages; Judge module is used for judging whether the combination of the adjacent text message of a part of text message of described a plurality of text messages is neologisms; Labeling module is used for described adjacent text message is carried out neologisms border mark; Estimation module is used for the parameter according to the described a plurality of text messages behind the mark and characteristic information estimation conditional random field models; And identification module, the parameter that is used for described conditional random field models is according to estimates identified described a plurality of text message residue text messages, to obtain the neologisms of residue text message in described a plurality of text message.

System according to the embodiment of the invention, by text message being carried out neologisms border mark, estimate the parameter of conditional random field models, and a plurality of text messages are identified to obtain neologisms in a plurality of text messages, can identify the neologisms in various fields, improve the efficient of identification simultaneously.

LE (w) = - \frac{1}{n} \underset{a &Element; A}{Σ} C (a, w) \log \frac{C (a, w)}{n},

RE (w) = - \frac{1}{n} \underset{a &Element; B}{Σ} C (w, a) \log \frac{C (w, a)}{n},

The aspect that the present invention adds and advantage part in the following description provide, and part will become obviously from the following description, or recognize by practice of the present invention.

Description of drawings

Above-mentioned and/or the additional aspect of the present invention and advantage are from obviously and easily understanding becoming the description of embodiment below in conjunction with accompanying drawing, wherein:

Fig. 1 is the process flow diagram of the method for open according to an embodiment of the invention field new word discovery; And

Fig. 2 is the structured flowchart of the system of open according to an embodiment of the invention field new word discovery.

Embodiment

Describe embodiments of the invention below in detail, the example of embodiment is shown in the drawings, and wherein identical or similar label is represented identical or similar elements or the element with identical or similar functions from start to finish.Be exemplary below by the embodiment that is described with reference to the drawings, only be used for explaining the present invention, and can not be interpreted as limitation of the present invention.

In description of the invention, it will be appreciated that term " first ", " second " only are used for describing purpose, and can not be interpreted as indication or hint relative importance or the implicit quantity that indicates indicated technical characterictic.Thus, one or more these features can be expressed or impliedly be comprised to the feature that is limited with " first ", " second ".In description of the invention, the implication of " a plurality of " is two or more, unless clear and definite concrete restriction is arranged in addition.

Fig. 1 is the process flow diagram of the method for open according to an embodiment of the invention field new word discovery.As shown in Figure 1, the method according to the open field new word discovery of the embodiment of the invention may further comprise the steps:

Step S101 receives pending language material, and language material is carried out format conversion and word segmentation processing, to obtain a plurality of text messages.

In one embodiment of the invention, adopt as " ICTCLAS3.0 Chinese automatic word-cut " (http://ictclas.org/), " JVnSegmenter Vietnam literary composition Words partition system " (http://jvnsegmenter.sourceforge.net/) etc. carries out participle to pending language material.

Step S102 extracts the characteristic information of a plurality of text messages.Characteristic information comprises left information entropy and right information entropy.

In one embodiment of the invention, left information entropy represents by following formula,

In one embodiment of the invention, right information entropy represents by following formula,

In one embodiment of the invention, characteristic information also comprises the additional feature information of handling by left information entropy and right information entropy.

This word length L ₀: the individual character number that each word comprises after the calculating participle.L ₀Weigh the quantity of the word that this word comprises, in Chinese corpus, even neologisms, its length can be very not long yet.So L ₀Can weigh this word as the possibility size of the part of neologisms.

This word part of speech POS ₀: carry out after the participle part-of-speech tagging that obtains by the participle instrument.In the new word discovery method of utilizing rule, often summed up by the linguist and safeguard the neologisms part of speech structure that some are common, for example, rule such as " n+v ", " v+v+n " etc., and come participle unregistered word is afterwards found with this.Adopt this method, its advantage is that the rule that the linguist sums up has higher accuracy rate to new word discovery.

This word is word frequency TF in full ₀: calculate this word occurrence number in whole corpus.Because the occurrence number span of word in corpus is very big, therefore this eigenwert has been done the discretize processing of 10 grades.

This word IDF ₀: it is defined as, Wherein, D represents total number of documents, D _wExpression includes the total number of documents of word w.

This word and the common frequency of occurrences IFA of last word ₀: calculate this word and last word while occurrence number in corpus.

This word this paper word frequency TFD ₀: calculate this word occurrence number in current document.

This word this paper left side information entropy LED ₀: be the left information entropy that objects of statistics is calculated this word with the current document.

The right information entropy RED of this word this paper ₀: be the right information entropy that objects of statistics is calculated this word with the current document.

This word average left information entropy LEDM ₀: it is defined as, Wherein, D _wFor comprising the collection of document of word w, LED _wThe left information entropy that expression w calculates in its place document.

The average right information entropy REDM of this word ₀: it is defined as,

Wherein, D _wFor comprising the collection of document of word w, RED _wThe right information entropy that expression w calculates in its place document.

Mutual information M ₀: it is defined as,

Wherein, p (w) is the probability of word w appearance.

In one embodiment of the invention, in the process that CRF learns, except each feature of direct appointment, feature can also be made up.The present invention also uses following assemblage characteristic as the input of CRF study.

(1) last word and this word length, this word and next word length: L _-1With L ₀And L ₀With L ₁

(2) last word and this word part of speech, this word and next word part of speech: POS _-1With POS ₀And POS ₀With POS ₁

(3) the right information entropy of this word and next word left side information entropy: RE ₀With LE ₁

(4) a last word and this word full text word frequency, this word and next word full text word frequency: TF _-1With TF ₀And TF ₀With TF ₁

(5) a last word and this word this paper word frequency, this word and next word this paper word frequency: TFD _-1With TFD ₀And TFD ₀With TFD ₁

(6) the right information entropy of this word this paper and next word this paper left side information entropy: RED ₀With RED ₁

(7) the average right information entropy of this word and the average right information entropy of next word: REDM ₀With REDM ₁

In one embodiment of the invention, used CRF model is in study and prediction, and the feature value of its input need be integer.Therefore, this step adopts the method for equifrequent discretize, with all listed in 3 features (except nonumeric feature: this word part of speech POS ₀) carry out the discretize of 10 grades.

Step S103 judges whether the combination of the adjacent text message of a part of text message in a plurality of text messages is neologisms.

Step S104 is if then carry out neologisms border mark to adjacent text message.

To a plurality of text messages after the participle, be text collection S, according to text chunk or the sentence be unit, randomly draw the text collection U of some K, adopt manual type among the U participle word border whether be the mark on neologisms border, if the i.e. adjacent word of participle A, B can be combined into a neologisms AB, in mark, the border of word A and B will be marked as " 0 " so.Otherwise the border of A and B will be marked as " 1 ", can mark so not by the neologisms of participle tool identification.When all being marked, all text collection U stop.

Step S105 is according to the parameter of a plurality of text messages behind the mark and characteristic information estimation conditional random field models.

Particularly, to the characteristic information of participle word and the neologisms sample set U of generation, estimate the parameter of CRF model in the following manner.

With the word of sample set U, U correspondence special and these words whether be the markup information feature on neologisms border as the input of CRF, carry out the parameter estimation of CRF model, to obtain the parameter of CRF model M.

Step S106 identifies residue text message in a plurality of text messages according to the parameter of the conditional random field models of estimating, to obtain the neologisms of residue text message in a plurality of text messages.

Particularly, parameter according to parameters C RF model M, with participle set of words S and definite parameters C RF model M as the input of CRF, whether be that predict on real neologisms border to participle word border, final CRF will output set I ', it comprises the participle set of words and to " 0 " or " 1 " that predicts on the word border, represents respectively that participle word border " is not " or "Yes" neologisms word border, and then finishes the new word discovery to the residue text message.

For example this method is elaborated below, is understandable that, following explanation is not limited thereto according to embodiments of the invention only for illustrative purposes.

20000 documents from SogouT2006, randomly drawing, and adopt ICTCLAS participle device to carry out participle, and extract the eigenwert of text message behind the participle.

Then, to the text message behind 20000 document participles, and calculate the value of each feature.According to the method for equifrequent discretize, all numerical characteristics are wherein carried out the discretize of 10 grades then.

Afterwards, as neologisms, the Web page text behind the participle in 1 is carried out word boundary " 0 ", " 1 " mark with the word of the incorrect participle of participle instrument among the SogouW.If the participle instrument has been divided into a plurality of words or word with neologisms, mark with " 0 " between these words or the word so; If the correct participle of participle instrument, these correct word borders then use " 1 " to mark.With these word that has marked and annotation results thereof, together with the feature of the discretize of these word correspondences, input CRF model carries out parameter estimation, to obtain the parameter of CRF model.Again neologisms to be found are carried out participle and discretize feature, and wait to find the border of neologisms according to the parameter estimation of CRF model.To obtain the estimated result by " 0 " or " 1 " expression.

In order to verify validity of the present invention and reliability, carried out following checking.

Experiment text data set structure aspect, the corpus of text that has carried out open field and specific area (news field) makes up.The SogouT2006 data set has been used in aspect, open field.This data set is the internet web page document that the search dog laboratory was grasped in 2006, amounts to 4,000 ten thousand Chinese internet web pages, and compression back size is about 130GB.Randomly draw 20000 documents in the checking as the corpus of open field demonstration test.The SogouCA data set has been used in the specific area aspect.This data set is the news data of some news site of search dog laboratory issue, amounts to about 1,000,000 news web pages, and compression back size is about 450MB.Test is therefrom randomly drawed 20000 documents as corpus.

It is as shown in table 1 that SogouT2006 and SogouCA have identical form.

Table 1

The word of participle tool identification identical in not pretreated among the internet dictionary SogouW of search dog laboratory issue is used in neologisms mark aspect, and as the neologisms in this confirmatory experiment, the word of automatically text data being concentrated with program carries out the neologisms mark.SogouW is the statistical study that comes from the Chinese internet language material that the SOGOU search engine is indexed, and its timing statistics is in October, 2006, with the time that makes up text data set in this test be consistent, amount to 150,000 high frequency words.

From the effect of open field new word discovery, be 93.4% by text deduction algorithm to the accuracy of new word discovery, recall rate is 94.2%, wherein, the F value is 93.8%.Neologisms distribution accuracy at specific area (news field) is 94.9%, and recall rate is 95.7%, and wherein, the F value is 95.3%.Thus, to find out that the present invention has good effect.

Fig. 2 is the structured flowchart of the system of open according to an embodiment of the invention field new word discovery.As shown in Figure 1, the system according to the open field new word discovery of the embodiment of the invention comprises word-dividing mode 100, extraction module 200, judge module 300, labeling module 400, estimation module 500 and identification module 600.

Wherein, word-dividing mode 100 is used for receiving pending language material, and language material is carried out format conversion and word segmentation processing, to obtain a plurality of text messages.

Extraction module 200 is used for extracting the characteristic information of a plurality of text messages.Characteristic information comprises left information entropy and right information entropy.Characteristic information also comprises the additional feature information of handling by left information entropy and right information entropy.

In one embodiment of the invention, right information entropy represents by following formula, Wherein, RE (w) is right information entropy, and w represents this word, and B is the set that is positioned at the word on w the right in the corpus, and (n is positive integer to C for a, the w) number of times that word w and a occur simultaneously in the expression corpus.

In one embodiment of the invention, this word part of speech POS ₀: extraction module 200 is undertaken by the participle instrument after the participle, the part-of-speech tagging that obtains.In the new word discovery method of utilizing rule, often summed up by the linguist and safeguard the neologisms part of speech structure that some are common, for example, rule such as " n+v ", " v+v+n " etc., and come participle unregistered word is afterwards found with this.Adopt this method, its advantage is that the rule that the linguist sums up has higher accuracy rate to new word discovery.

In one embodiment of the invention, this word full text word frequency TF ₀: extraction module 200 calculates this word occurrence number in whole corpus.Because the occurrence number span of word in corpus is very big, therefore this eigenwert has been done the discretize processing of 10 grades.

This word IDF ₀: it is defined as,

Wherein, D represents total number of documents, D _wExpression includes the total number of documents of word w.

This word average left information entropy LEDM ₀: it is defined as,

Wherein, D _wFor comprising the collection of document of word w, LED _wThe left information entropy that expression w calculates in its place document.

The average right information entropy REDM of this word ₀: it is defined as,

Mutual information M ₀: it is defined as, Wherein, p (w) is the probability of word w appearance.

CRF except each feature of direct appointment, can also make up feature in the process of learning.The present invention also uses following assemblage characteristic as the input of CRF study.

In one embodiment of the invention, used CRF model study and the prediction the time, the feature value of its input need be integer.Therefore, this step adopts the method for equifrequent discretize, with all listed in 3 features (except nonumeric feature: this word part of speech POS ₀) carry out the discretize of 10 grades.

Whether judge module 300 is neologisms for the combination of the adjacent text message of judging a part of text message of a plurality of text messages.

Labeling module 400 is used for adjacent text message is carried out neologisms border mark.

To a plurality of text messages (being text collection S) after the participle, according to text chunk or the sentence be unit, randomly draw the text collection U of some K, labeling module 400 adopt manual types among the U participle word border whether be the mark on neologisms border, if the i.e. adjacent word of participle A, B can be combined into a neologisms AB, and in mark, the border of word A and B will be marked as " 0 " so.Otherwise the border of A and B will be marked as " 1 ", can mark so not by the neologisms of participle tool identification.When all being marked, all text collection U stop.

Estimation module 500 is used for the parameter according to a plurality of text messages behind the mark and characteristic information estimation conditional random field models.

Particularly, 500 pairs of estimation module are the characteristic information of participle word and the neologisms sample set U of generation, estimates the parameter of CRF model in the following manner.

Identification module 600 is used for the parameter of conditional random field models according to estimates to be identified a plurality of text message residue text messages, to obtain the neologisms of residue text message in a plurality of text messages.

Identification module 600 is according to the parameter of parameters C RF model M, with participle set of words S and definite parameters C RF model M as the input of CRF, whether be that predict on real neologisms border to participle word border, final CRF will output set I ', " 0 " or " 1 " that set I ' comprises participle set of words and the word border is predicted, represent respectively that participle word border " is not " or "Yes" neologisms word border, and then finish the new word discovery to the residue text message.

Although illustrated and described embodiments of the invention above, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, those of ordinary skill in the art can change above-described embodiment under the situation that does not break away from principle of the present invention and aim within the scope of the invention, modification, replacement and modification.

Claims

1. a method of opening the field new word discovery is characterized in that, may further comprise the steps:

Receive pending language material, and described language material is carried out format conversion and word segmentation processing, to obtain a plurality of text messages;

Extract the characteristic information of described a plurality of text messages;

Whether the combination of judging the adjacent text message of a part of text message in described a plurality of text message is neologisms;

If then described adjacent text message is carried out neologisms border mark;

Parameter according to the described a plurality of text messages behind the mark and characteristic information estimation conditional random field models;

Parameter according to the described conditional random field models of estimating is identified residue text message in described a plurality of text messages, to obtain the neologisms of residue text message in described a plurality of text message.

2. the method for open field as claimed in claim 1 new word discovery is characterized in that described characteristic information comprises left information entropy and right information entropy.

3. the method for open field as claimed in claim 1 new word discovery is characterized in that described characteristic information also comprises the additional feature information of handling by described left information entropy and right information entropy.

4. the method for open field as claimed in claim 2 new word discovery is characterized in that, described left information entropy represents by following formula,

LE (w) = - \frac{1}{n} \underset{a &Element; A}{Σ} C (a, w) \log \frac{C (a, w)}{n},

5. the method for open field as claimed in claim 3 new word discovery is characterized in that, described right information entropy represents by following formula,

RE (w) = - \frac{1}{n} \underset{a &Element; B}{Σ} C (w, a) \log \frac{C (w, a)}{n},

6. a system that opens the field new word discovery is characterized in that, comprising:

Word-dividing mode is used for receiving pending language material, and described language material is carried out format conversion and word segmentation processing, to obtain a plurality of text messages;

Extraction module is for the characteristic information that extracts described a plurality of text messages;

Judge module is used for judging whether the combination of the adjacent text message of a part of text message of described a plurality of text messages is neologisms;

Labeling module is used for described adjacent text message is carried out neologisms border mark;

Estimation module is used for the parameter according to the described a plurality of text messages behind the mark and characteristic information estimation conditional random field models; And

Identification module, the parameter that is used for described conditional random field models is according to estimates identified described a plurality of text message residue text messages, to obtain the neologisms of residue text message in described a plurality of text message.

7. the system of open field as claimed in claim 6 new word discovery is characterized in that described characteristic information comprises left information entropy and right information entropy.

8. the system of open field as claimed in claim 6 new word discovery is characterized in that described characteristic information also comprises the additional feature information of handling by described left information entropy and right information entropy.

9. the system of open field as claimed in claim 7 new word discovery is characterized in that, described left information entropy represents by following formula,

LE (w) = - \frac{1}{n} \underset{a &Element; A}{Σ} C (a, w) \log \frac{C (a, w)}{n},

10. the system of open field as claimed in claim 8 new word discovery is characterized in that, described right information entropy represents by following formula,

RE (w) = - \frac{1}{n} \underset{a &Element; B}{Σ} C (w, a) \log \frac{C (w, a)}{n},