CN103294664A - Method and system for discovering new words in open fields - Google Patents

Method and system for discovering new words in open fields Download PDF

Info

Publication number
CN103294664A
CN103294664A CN2013102791845A CN201310279184A CN103294664A CN 103294664 A CN103294664 A CN 103294664A CN 2013102791845 A CN2013102791845 A CN 2013102791845A CN 201310279184 A CN201310279184 A CN 201310279184A CN 103294664 A CN103294664 A CN 103294664A
Authority
CN
China
Prior art keywords
word
text messages
information entropy
text message
neologisms
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013102791845A
Other languages
Chinese (zh)
Inventor
陈飞
刘奕群
马少平
张敏
金奕江
张阔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN2013102791845A priority Critical patent/CN103294664A/en
Publication of CN103294664A publication Critical patent/CN103294664A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a method and a system for discovering new words in open fields. The method includes receiving corpora to be processed, conducting format conversion and word segmentation processing on the corpora to obtain a plurality of text messages, extracting characteristic messages of the plurality of text messages, judging whether combinations of adjacent text messages of a part of the text messages in the plurality of text messages are new words, conducting new word boundary labeling on the adjacent text messages on yes judgment, estimating parameters of a conditional random field model according to the labeled text messages and characteristic messages and identifying the surplus text messages according to the estimated conditional random field model to obtain new words of the surplus text messages. By means of the method, the new word boundary labeling is conducted on the text messages, the parameters of the conditional random field model are estimated, the plurality of text messages are identified to obtain the new words in the text messages, the new words in various fields can be identified, and meanwhile identification efficiency is improved.

Description

The method and system of open field new word discovery
Technical field
The present invention relates to the information intelligent processing technology field, particularly a kind of method and system of opening the field new word discovery.
Background technology
Because Chinese unlike western languages such as English, does not have fixing separator, so the steps necessary that participle begins as the Chinese information processing task usually most between word and the word.Existing studies show that, the performance of word (be unregistered word, this paper indication neologisms belong to unregistered word) the meeting appreciable impact participle that the participle instrument dictionary that runs in the branch word task does not comprise.Therefore, new word discovery is for the raising participle, so that follow-up work has important meaning.In addition, the appearance of web2.0 application such as individual blog, individualized signature, microblogging in recent years allows user oneself to generate web page contents, causes being similar to new terms such as " refreshing horse ", " Super Girl " to occur in a large number, and with very fast speed renewal, make new word discovery face more challenges.
About the research of new word discovery, mainly concentrate on the automatic extraction of name, place name, translation abbreviation or certain several field term (for example, fields such as military affairs, finance and economics) at present.According to the scope of new word discovery task, it can be divided three classes: 1.one-for-one, this class research mainly solves the new word discovery of certain specific question, for example, name, place name etc.2.one-for-several, mainly solve the new word discovery in several certain kinds or field.3.one-for-all, towards all problems, namely open the new word discovery in the field.Classify according to its method that adopts, mainly contain following three classes: 1. based on the n-gram language model.2. depend on after certain participle instrument participle, carry out new word discovery.3. new word discovery and participle instrument are carried out combination, in participle, detect neologisms.In addition, also have with the analysis user behavior, adopt the method for collaborative filtering to carry out new word discovery.Except above-mentioned sorting technique, also new word discovery can be divided into based on language rule with based on the method for statistical machine study, this also is the mode classification of present main flow.
Because the fast development of media of communication such as internet, neologisms also produce with the speed that is exceedingly fast, and change various, be difficult to the pattern rule coupling, even generated rule template, because the neologisms renewal is very rapid, and the composition rule variation can make the very fast inefficacy of template.On the other hand, proposition and the maintenance of rule need be undertaken by the linguist, not only expend time in, money, and can not expand.Therefore present new word discovery work, mainly concentrate under certain or certain several special dimensions the method according to statistical machine study, (for example introduce domain knowledge, military field distinctive " fighter plane ", " tank ", field of finance and economics distinctive " stock ", " listing " etc.) carry out the new word discovery of specific area.Rely on the markup information of domain knowledge and these field neologisms, the method of machine learning can better be acquired the model that is fit to this field new word discovery, even if the method for rule-based coupling also can propose more rules specific to this field word-building, thereby improve accuracy and the recall rate of identification.And for open field new word discovery, its neologisms identifying is towards all possible field, or even picture " refreshing horse " is this not at the neologisms of any specific area, and therefore current research is less by contrast.Its maximum difficult point is no matter be rule match or statistical machine study, all do not have available domain knowledge to be optimized at new word discovery.Because the division in field and possible quantity thereof are uncertain, even exist certain than more comprehensive field division rule, judge that it also is very difficult that neologisms to be found belong to certain field, these have all increased the difficulty of open field neologisms identification undoubtedly.
Conditional random field models (CRF) is as a kind of probability graph model, its principle is to decompose by the conditional probability distribution function p (y|x) to list entries (observational variable) x and output (hidden variable) y, rather than directly decompose joint probability distribution p (x, y), thus avoid to the input probability distribution p (x) carry out modeling and calculating.Because input x has the relation of more complicated usually, the calculating of p (x) is often needed input is had certain independence assumption.And CRF has avoided the calculating to p (x) by direct decomposition condition probability distribution p (y|x), so it can better learn the relation of data centralization.
In linear chain CRF, with p (x, y) do following decomposition:
p ( y | x ; ω ) = 1 Z ( x , ω ) exp Σ i = 1 N Σ j ω j f j ( y i - 1 , y i , x , i ) , Z ( x , y ) = Σ Z 1 , N exp Σ i = 1 N Σ j ω j f j ( y i - 1 , y i , x , i ) ,
Wherein, f jBe fundamental function, i represents list entries x at fundamental function f jLast summation can guarantee like this for elongated input f jThe fundamental function value of estimating the j number is arranged, and ω is for treating estimated parameter, and (x ω) is normalized factor, ω to Z jFor treating estimated parameter, y iBe the output of CRF model, N is the total sample number amount.Though fundamental function f in theory, jIn can produce relation with all x, but when reality is used, consider in complicacy and the practical problems feature of relation between the input, what may select only is that current input and forward and backward one or two input are as the independent variable of this fundamental function.The advantage of CRF is, need not suppose to import the independence relation between the x, just can calculate p (y|x, w).And the relation between input and the output is the fundamental function f by user's appointment in specific task of CRF jAnd the parameter w of the automatic study of CRF jEmbody.Linear chain CRF then has the certain condition restriction to CRF: current output y iExcept having the funtcional relationship with x, can only with previous output y I-1Relevant.In the new word discovery task, need the current word of prediction whether can constitute neologisms with contiguous word and (namely export y i), its result not only depends on the feature value (namely importing x) of these several words, and also depending on predicting the outcome of a last word (is y I-1), because whether a last word is predicted to be the prediction that neologisms can influence current word, the model of this and linear chain CRF just in time coincide.
Summary of the invention
Purpose of the present invention is intended to solve at least one of above-mentioned technological deficiency.
For this reason, one object of the present invention is to propose a kind of method of opening the field new word discovery.
Another object of the present invention is to propose a kind of system that opens the field new word discovery.
For achieving the above object, the embodiment of one aspect of the present invention proposes a kind of method of opening the field new word discovery, may further comprise the steps: receive pending language material, and described language material is carried out format conversion and word segmentation processing, to obtain a plurality of text messages; Extract the characteristic information of described a plurality of text messages; Whether the combination of judging the adjacent text message of a part of text message in described a plurality of text message is neologisms; If then described adjacent text message is carried out neologisms border mark; Parameter according to the described a plurality of text messages behind the mark and characteristic information estimation conditional random field models; Parameter according to the described conditional random field models of estimating is identified residue text message in described a plurality of text messages, to obtain the neologisms of residue text message in described a plurality of text message.
Method according to the embodiment of the invention, by text message being carried out neologisms border mark, estimate the parameter of conditional random field models, and a plurality of text messages are identified to obtain neologisms in a plurality of text messages, can identify the neologisms in various fields, improve the efficient of identification simultaneously.
In one embodiment of the invention, described characteristic information comprises left information entropy and right information entropy.
In one embodiment of the invention, described characteristic information also comprises the additional feature information of handling by described left information entropy and right information entropy.
In one embodiment of the invention, described left information entropy represents by following formula,
LE ( w ) = - 1 n Σ a ∈ A C ( a , w ) log C ( a , w ) n ,
Wherein, LE (w) is left information entropy, and w represents this word, and A is the set that is positioned at the word on the w left side in the corpus, and (n is positive integer to C for a, the w) number of times that word a and w occur simultaneously in the expression corpus.
In one embodiment of the invention, described right information entropy represents by following formula,
RE ( w ) = - 1 n Σ a ∈ B C ( w , a ) log C ( w , a ) n ,
Wherein, RE (w) is right information entropy, and w represents this word, and B is the set that is positioned at the word on w the right in the corpus, and (n is positive integer to C for a, the w) number of times that word w and a occur simultaneously in the expression corpus.
For achieving the above object, embodiments of the invention propose a kind of system that opens the field new word discovery on the other hand, comprising: word-dividing mode is used for receiving pending language material, and described language material is carried out format conversion and word segmentation processing, to obtain a plurality of text messages; Extraction module is for the characteristic information that extracts described a plurality of text messages; Judge module is used for judging whether the combination of the adjacent text message of a part of text message of described a plurality of text messages is neologisms; Labeling module is used for described adjacent text message is carried out neologisms border mark; Estimation module is used for the parameter according to the described a plurality of text messages behind the mark and characteristic information estimation conditional random field models; And identification module, the parameter that is used for described conditional random field models is according to estimates identified described a plurality of text message residue text messages, to obtain the neologisms of residue text message in described a plurality of text message.
System according to the embodiment of the invention, by text message being carried out neologisms border mark, estimate the parameter of conditional random field models, and a plurality of text messages are identified to obtain neologisms in a plurality of text messages, can identify the neologisms in various fields, improve the efficient of identification simultaneously.
In one embodiment of the invention, described characteristic information comprises left information entropy and right information entropy.
In one embodiment of the invention, described characteristic information also comprises the additional feature information of handling by described left information entropy and right information entropy.
In one embodiment of the invention, described left information entropy represents by following formula,
LE ( w ) = - 1 n Σ a ∈ A C ( a , w ) log C ( a , w ) n ,
Wherein, LE (w) is left information entropy, and w represents this word, and A is the set that is positioned at the word on the w left side in the corpus, and (n is positive integer to C for a, the w) number of times that word a and w occur simultaneously in the expression corpus.
In one embodiment of the invention, described right information entropy represents by following formula,
RE ( w ) = - 1 n Σ a ∈ B C ( w , a ) log C ( w , a ) n ,
Wherein, RE (w) is right information entropy, and w represents this word, and B is the set that is positioned at the word on w the right in the corpus, and (n is positive integer to C for a, the w) number of times that word w and a occur simultaneously in the expression corpus.
The aspect that the present invention adds and advantage part in the following description provide, and part will become obviously from the following description, or recognize by practice of the present invention.
Description of drawings
Above-mentioned and/or the additional aspect of the present invention and advantage are from obviously and easily understanding becoming the description of embodiment below in conjunction with accompanying drawing, wherein:
Fig. 1 is the process flow diagram of the method for open according to an embodiment of the invention field new word discovery; And
Fig. 2 is the structured flowchart of the system of open according to an embodiment of the invention field new word discovery.
Embodiment
Describe embodiments of the invention below in detail, the example of embodiment is shown in the drawings, and wherein identical or similar label is represented identical or similar elements or the element with identical or similar functions from start to finish.Be exemplary below by the embodiment that is described with reference to the drawings, only be used for explaining the present invention, and can not be interpreted as limitation of the present invention.
In description of the invention, it will be appreciated that term " first ", " second " only are used for describing purpose, and can not be interpreted as indication or hint relative importance or the implicit quantity that indicates indicated technical characterictic.Thus, one or more these features can be expressed or impliedly be comprised to the feature that is limited with " first ", " second ".In description of the invention, the implication of " a plurality of " is two or more, unless clear and definite concrete restriction is arranged in addition.
Fig. 1 is the process flow diagram of the method for open according to an embodiment of the invention field new word discovery.As shown in Figure 1, the method according to the open field new word discovery of the embodiment of the invention may further comprise the steps:
Step S101 receives pending language material, and language material is carried out format conversion and word segmentation processing, to obtain a plurality of text messages.
In one embodiment of the invention, adopt as " ICTCLAS3.0 Chinese automatic word-cut " (http://ictclas.org/), " JVnSegmenter Vietnam literary composition Words partition system " (http://jvnsegmenter.sourceforge.net/) etc. carries out participle to pending language material.
Step S102 extracts the characteristic information of a plurality of text messages.Characteristic information comprises left information entropy and right information entropy.
In one embodiment of the invention, left information entropy represents by following formula,
Figure BDA00003462770100051
Wherein, LE (w) is left information entropy, and w represents this word, and A is the set that is positioned at the word on the w left side in the corpus, and (n is positive integer to C for a, the w) number of times that word a and w occur simultaneously in the expression corpus.
In one embodiment of the invention, right information entropy represents by following formula,
Figure BDA00003462770100052
Wherein, RE (w) is right information entropy, and w represents this word, and B is the set that is positioned at the word on w the right in the corpus, and (n is positive integer to C for a, the w) number of times that word w and a occur simultaneously in the expression corpus.
In one embodiment of the invention, characteristic information also comprises the additional feature information of handling by left information entropy and right information entropy.
This word length L 0: the individual character number that each word comprises after the calculating participle.L 0Weigh the quantity of the word that this word comprises, in Chinese corpus, even neologisms, its length can be very not long yet.So L 0Can weigh this word as the possibility size of the part of neologisms.
This word part of speech POS 0: carry out after the participle part-of-speech tagging that obtains by the participle instrument.In the new word discovery method of utilizing rule, often summed up by the linguist and safeguard the neologisms part of speech structure that some are common, for example, rule such as " n+v ", " v+v+n " etc., and come participle unregistered word is afterwards found with this.Adopt this method, its advantage is that the rule that the linguist sums up has higher accuracy rate to new word discovery.
This word is word frequency TF in full 0: calculate this word occurrence number in whole corpus.Because the occurrence number span of word in corpus is very big, therefore this eigenwert has been done the discretize processing of 10 grades.
This word IDF 0: it is defined as, Wherein, D represents total number of documents, D wExpression includes the total number of documents of word w.
This word and the common frequency of occurrences IFA of last word 0: calculate this word and last word while occurrence number in corpus.
This word this paper word frequency TFD 0: calculate this word occurrence number in current document.
This word this paper left side information entropy LED 0: be the left information entropy that objects of statistics is calculated this word with the current document.
The right information entropy RED of this word this paper 0: be the right information entropy that objects of statistics is calculated this word with the current document.
This word average left information entropy LEDM 0: it is defined as, Wherein, D wFor comprising the collection of document of word w, LED wThe left information entropy that expression w calculates in its place document.
The average right information entropy REDM of this word 0: it is defined as,
Figure BDA00003462770100062
Wherein, D wFor comprising the collection of document of word w, RED wThe right information entropy that expression w calculates in its place document.
Mutual information M 0: it is defined as,
Figure BDA00003462770100063
Wherein, p (w) is the probability of word w appearance.
In one embodiment of the invention, in the process that CRF learns, except each feature of direct appointment, feature can also be made up.The present invention also uses following assemblage characteristic as the input of CRF study.
(1) last word and this word length, this word and next word length: L -1With L 0And L 0With L 1
(2) last word and this word part of speech, this word and next word part of speech: POS -1With POS 0And POS 0With POS 1
(3) the right information entropy of this word and next word left side information entropy: RE 0With LE 1
(4) a last word and this word full text word frequency, this word and next word full text word frequency: TF -1With TF 0And TF 0With TF 1
(5) a last word and this word this paper word frequency, this word and next word this paper word frequency: TFD -1With TFD 0And TFD 0With TFD 1
(6) the right information entropy of this word this paper and next word this paper left side information entropy: RED 0With RED 1
(7) the average right information entropy of this word and the average right information entropy of next word: REDM 0With REDM 1
In one embodiment of the invention, used CRF model is in study and prediction, and the feature value of its input need be integer.Therefore, this step adopts the method for equifrequent discretize, with all listed in 3 features (except nonumeric feature: this word part of speech POS 0) carry out the discretize of 10 grades.
Step S103 judges whether the combination of the adjacent text message of a part of text message in a plurality of text messages is neologisms.
Step S104 is if then carry out neologisms border mark to adjacent text message.
To a plurality of text messages after the participle, be text collection S, according to text chunk or the sentence be unit, randomly draw the text collection U of some K, adopt manual type among the U participle word border whether be the mark on neologisms border, if the i.e. adjacent word of participle A, B can be combined into a neologisms AB, in mark, the border of word A and B will be marked as " 0 " so.Otherwise the border of A and B will be marked as " 1 ", can mark so not by the neologisms of participle tool identification.When all being marked, all text collection U stop.
Step S105 is according to the parameter of a plurality of text messages behind the mark and characteristic information estimation conditional random field models.
Particularly, to the characteristic information of participle word and the neologisms sample set U of generation, estimate the parameter of CRF model in the following manner.
With the word of sample set U, U correspondence special and these words whether be the markup information feature on neologisms border as the input of CRF, carry out the parameter estimation of CRF model, to obtain the parameter of CRF model M.
Step S106 identifies residue text message in a plurality of text messages according to the parameter of the conditional random field models of estimating, to obtain the neologisms of residue text message in a plurality of text messages.
Particularly, parameter according to parameters C RF model M, with participle set of words S and definite parameters C RF model M as the input of CRF, whether be that predict on real neologisms border to participle word border, final CRF will output set I ', it comprises the participle set of words and to " 0 " or " 1 " that predicts on the word border, represents respectively that participle word border " is not " or "Yes" neologisms word border, and then finishes the new word discovery to the residue text message.
For example this method is elaborated below, is understandable that, following explanation is not limited thereto according to embodiments of the invention only for illustrative purposes.
20000 documents from SogouT2006, randomly drawing, and adopt ICTCLAS participle device to carry out participle, and extract the eigenwert of text message behind the participle.
Then, to the text message behind 20000 document participles, and calculate the value of each feature.According to the method for equifrequent discretize, all numerical characteristics are wherein carried out the discretize of 10 grades then.
Afterwards, as neologisms, the Web page text behind the participle in 1 is carried out word boundary " 0 ", " 1 " mark with the word of the incorrect participle of participle instrument among the SogouW.If the participle instrument has been divided into a plurality of words or word with neologisms, mark with " 0 " between these words or the word so; If the correct participle of participle instrument, these correct word borders then use " 1 " to mark.With these word that has marked and annotation results thereof, together with the feature of the discretize of these word correspondences, input CRF model carries out parameter estimation, to obtain the parameter of CRF model.Again neologisms to be found are carried out participle and discretize feature, and wait to find the border of neologisms according to the parameter estimation of CRF model.To obtain the estimated result by " 0 " or " 1 " expression.
Method according to the embodiment of the invention, by text message being carried out neologisms border mark, estimate the parameter of conditional random field models, and a plurality of text messages are identified to obtain neologisms in a plurality of text messages, can identify the neologisms in various fields, improve the efficient of identification simultaneously.
In order to verify validity of the present invention and reliability, carried out following checking.
Experiment text data set structure aspect, the corpus of text that has carried out open field and specific area (news field) makes up.The SogouT2006 data set has been used in aspect, open field.This data set is the internet web page document that the search dog laboratory was grasped in 2006, amounts to 4,000 ten thousand Chinese internet web pages, and compression back size is about 130GB.Randomly draw 20000 documents in the checking as the corpus of open field demonstration test.The SogouCA data set has been used in the specific area aspect.This data set is the news data of some news site of search dog laboratory issue, amounts to about 1,000,000 news web pages, and compression back size is about 450MB.Test is therefrom randomly drawed 20000 documents as corpus.
It is as shown in table 1 that SogouT2006 and SogouCA have identical form.
Table 1
The word of participle tool identification identical in not pretreated among the internet dictionary SogouW of search dog laboratory issue is used in neologisms mark aspect, and as the neologisms in this confirmatory experiment, the word of automatically text data being concentrated with program carries out the neologisms mark.SogouW is the statistical study that comes from the Chinese internet language material that the SOGOU search engine is indexed, and its timing statistics is in October, 2006, with the time that makes up text data set in this test be consistent, amount to 150,000 high frequency words.
From the effect of open field new word discovery, be 93.4% by text deduction algorithm to the accuracy of new word discovery, recall rate is 94.2%, wherein, the F value is 93.8%.Neologisms distribution accuracy at specific area (news field) is 94.9%, and recall rate is 95.7%, and wherein, the F value is 95.3%.Thus, to find out that the present invention has good effect.
Fig. 2 is the structured flowchart of the system of open according to an embodiment of the invention field new word discovery.As shown in Figure 1, the system according to the open field new word discovery of the embodiment of the invention comprises word-dividing mode 100, extraction module 200, judge module 300, labeling module 400, estimation module 500 and identification module 600.
Wherein, word-dividing mode 100 is used for receiving pending language material, and language material is carried out format conversion and word segmentation processing, to obtain a plurality of text messages.
Extraction module 200 is used for extracting the characteristic information of a plurality of text messages.Characteristic information comprises left information entropy and right information entropy.Characteristic information also comprises the additional feature information of handling by left information entropy and right information entropy.
In one embodiment of the invention, left information entropy represents by following formula,
Figure BDA00003462770100082
Wherein, LE (w) is left information entropy, and w represents this word, and A is the set that is positioned at the word on the w left side in the corpus, and (n is positive integer to C for a, the w) number of times that word a and w occur simultaneously in the expression corpus.
In one embodiment of the invention, right information entropy represents by following formula, Wherein, RE (w) is right information entropy, and w represents this word, and B is the set that is positioned at the word on w the right in the corpus, and (n is positive integer to C for a, the w) number of times that word w and a occur simultaneously in the expression corpus.
In one embodiment of the invention, this word part of speech POS 0: extraction module 200 is undertaken by the participle instrument after the participle, the part-of-speech tagging that obtains.In the new word discovery method of utilizing rule, often summed up by the linguist and safeguard the neologisms part of speech structure that some are common, for example, rule such as " n+v ", " v+v+n " etc., and come participle unregistered word is afterwards found with this.Adopt this method, its advantage is that the rule that the linguist sums up has higher accuracy rate to new word discovery.
In one embodiment of the invention, this word full text word frequency TF 0: extraction module 200 calculates this word occurrence number in whole corpus.Because the occurrence number span of word in corpus is very big, therefore this eigenwert has been done the discretize processing of 10 grades.
This word IDF 0: it is defined as,
Figure BDA00003462770100091
Wherein, D represents total number of documents, D wExpression includes the total number of documents of word w.
This word and the common frequency of occurrences IFA of last word 0: calculate this word and last word while occurrence number in corpus.
This word this paper word frequency TFD 0: calculate this word occurrence number in current document.
This word this paper left side information entropy LED 0: be the left information entropy that objects of statistics is calculated this word with the current document.
The right information entropy RED of this word this paper 0: be the right information entropy that objects of statistics is calculated this word with the current document.
This word average left information entropy LEDM 0: it is defined as,
Figure BDA00003462770100092
Wherein, D wFor comprising the collection of document of word w, LED wThe left information entropy that expression w calculates in its place document.
The average right information entropy REDM of this word 0: it is defined as,
Figure BDA00003462770100093
Wherein, D wFor comprising the collection of document of word w, RED wThe right information entropy that expression w calculates in its place document.
Mutual information M 0: it is defined as, Wherein, p (w) is the probability of word w appearance.
CRF except each feature of direct appointment, can also make up feature in the process of learning.The present invention also uses following assemblage characteristic as the input of CRF study.
(1) last word and this word length, this word and next word length: L -1With L 0And L 0With L 1
(2) last word and this word part of speech, this word and next word part of speech: POS -1With POS 0And POS 0With POS 1
(3) the right information entropy of this word and next word left side information entropy: RE 0With LE 1
(4) a last word and this word full text word frequency, this word and next word full text word frequency: TF -1With TF 0And TF 0With TF 1
(5) a last word and this word this paper word frequency, this word and next word this paper word frequency: TFD -1With TFD 0And TFD 0With TFD 1
(6) the right information entropy of this word this paper and next word this paper left side information entropy: RED 0With RED 1
(7) the average right information entropy of this word and the average right information entropy of next word: REDM 0With REDM 1
In one embodiment of the invention, used CRF model study and the prediction the time, the feature value of its input need be integer.Therefore, this step adopts the method for equifrequent discretize, with all listed in 3 features (except nonumeric feature: this word part of speech POS 0) carry out the discretize of 10 grades.
Whether judge module 300 is neologisms for the combination of the adjacent text message of judging a part of text message of a plurality of text messages.
Labeling module 400 is used for adjacent text message is carried out neologisms border mark.
To a plurality of text messages (being text collection S) after the participle, according to text chunk or the sentence be unit, randomly draw the text collection U of some K, labeling module 400 adopt manual types among the U participle word border whether be the mark on neologisms border, if the i.e. adjacent word of participle A, B can be combined into a neologisms AB, and in mark, the border of word A and B will be marked as " 0 " so.Otherwise the border of A and B will be marked as " 1 ", can mark so not by the neologisms of participle tool identification.When all being marked, all text collection U stop.
Estimation module 500 is used for the parameter according to a plurality of text messages behind the mark and characteristic information estimation conditional random field models.
Particularly, 500 pairs of estimation module are the characteristic information of participle word and the neologisms sample set U of generation, estimates the parameter of CRF model in the following manner.
With the word of sample set U, U correspondence special and these words whether be the markup information feature on neologisms border as the input of CRF, carry out the parameter estimation of CRF model, to obtain the parameter of CRF model M.
Identification module 600 is used for the parameter of conditional random field models according to estimates to be identified a plurality of text message residue text messages, to obtain the neologisms of residue text message in a plurality of text messages.
Identification module 600 is according to the parameter of parameters C RF model M, with participle set of words S and definite parameters C RF model M as the input of CRF, whether be that predict on real neologisms border to participle word border, final CRF will output set I ', " 0 " or " 1 " that set I ' comprises participle set of words and the word border is predicted, represent respectively that participle word border " is not " or "Yes" neologisms word border, and then finish the new word discovery to the residue text message.
For example this method is elaborated below, is understandable that, following explanation is not limited thereto according to embodiments of the invention only for illustrative purposes.
20000 documents from SogouT2006, randomly drawing, and adopt ICTCLAS participle device to carry out participle, and extract the eigenwert of text message behind the participle.
Then, to the text message behind 20000 document participles, and calculate the value of each feature.According to the method for equifrequent discretize, all numerical characteristics are wherein carried out the discretize of 10 grades then.
Afterwards, as neologisms, the Web page text behind the participle in 1 is carried out word boundary " 0 ", " 1 " mark with the word of the incorrect participle of participle instrument among the SogouW.If the participle instrument has been divided into a plurality of words or word with neologisms, mark with " 0 " between these words or the word so; If the correct participle of participle instrument, these correct word borders then use " 1 " to mark.With these word that has marked and annotation results thereof, together with the feature of the discretize of these word correspondences, input CRF model carries out parameter estimation, to obtain the parameter of CRF model.Again neologisms to be found are carried out participle and discretize feature, and wait to find the border of neologisms according to the parameter estimation of CRF model.To obtain the estimated result by " 0 " or " 1 " expression.
System according to the embodiment of the invention, by text message being carried out neologisms border mark, estimate the parameter of conditional random field models, and a plurality of text messages are identified to obtain neologisms in a plurality of text messages, can identify the neologisms in various fields, improve the efficient of identification simultaneously.
Although illustrated and described embodiments of the invention above, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, those of ordinary skill in the art can change above-described embodiment under the situation that does not break away from principle of the present invention and aim within the scope of the invention, modification, replacement and modification.

Claims (10)

1. a method of opening the field new word discovery is characterized in that, may further comprise the steps:
Receive pending language material, and described language material is carried out format conversion and word segmentation processing, to obtain a plurality of text messages;
Extract the characteristic information of described a plurality of text messages;
Whether the combination of judging the adjacent text message of a part of text message in described a plurality of text message is neologisms;
If then described adjacent text message is carried out neologisms border mark;
Parameter according to the described a plurality of text messages behind the mark and characteristic information estimation conditional random field models;
Parameter according to the described conditional random field models of estimating is identified residue text message in described a plurality of text messages, to obtain the neologisms of residue text message in described a plurality of text message.
2. the method for open field as claimed in claim 1 new word discovery is characterized in that described characteristic information comprises left information entropy and right information entropy.
3. the method for open field as claimed in claim 1 new word discovery is characterized in that described characteristic information also comprises the additional feature information of handling by described left information entropy and right information entropy.
4. the method for open field as claimed in claim 2 new word discovery is characterized in that, described left information entropy represents by following formula,
LE ( w ) = - 1 n Σ a ∈ A C ( a , w ) log C ( a , w ) n ,
Wherein, LE (w) is left information entropy, and w represents this word, and A is the set that is positioned at the word on the w left side in the corpus, and (n is positive integer to C for a, the w) number of times that word a and w occur simultaneously in the expression corpus.
5. the method for open field as claimed in claim 3 new word discovery is characterized in that, described right information entropy represents by following formula,
RE ( w ) = - 1 n Σ a ∈ B C ( w , a ) log C ( w , a ) n ,
Wherein, RE (w) is right information entropy, and w represents this word, and B is the set that is positioned at the word on w the right in the corpus, and (n is positive integer to C for a, the w) number of times that word w and a occur simultaneously in the expression corpus.
6. a system that opens the field new word discovery is characterized in that, comprising:
Word-dividing mode is used for receiving pending language material, and described language material is carried out format conversion and word segmentation processing, to obtain a plurality of text messages;
Extraction module is for the characteristic information that extracts described a plurality of text messages;
Judge module is used for judging whether the combination of the adjacent text message of a part of text message of described a plurality of text messages is neologisms;
Labeling module is used for described adjacent text message is carried out neologisms border mark;
Estimation module is used for the parameter according to the described a plurality of text messages behind the mark and characteristic information estimation conditional random field models; And
Identification module, the parameter that is used for described conditional random field models is according to estimates identified described a plurality of text message residue text messages, to obtain the neologisms of residue text message in described a plurality of text message.
7. the system of open field as claimed in claim 6 new word discovery is characterized in that described characteristic information comprises left information entropy and right information entropy.
8. the system of open field as claimed in claim 6 new word discovery is characterized in that described characteristic information also comprises the additional feature information of handling by described left information entropy and right information entropy.
9. the system of open field as claimed in claim 7 new word discovery is characterized in that, described left information entropy represents by following formula,
LE ( w ) = - 1 n Σ a ∈ A C ( a , w ) log C ( a , w ) n ,
Wherein, LE (w) is left information entropy, and w represents this word, and A is the set that is positioned at the word on the w left side in the corpus, and (n is positive integer to C for a, the w) number of times that word a and w occur simultaneously in the expression corpus.
10. the system of open field as claimed in claim 8 new word discovery is characterized in that, described right information entropy represents by following formula,
RE ( w ) = - 1 n Σ a ∈ B C ( w , a ) log C ( w , a ) n ,
Wherein, RE (w) is right information entropy, and w represents this word, and B is the set that is positioned at the word on w the right in the corpus, and (n is positive integer to C for a, the w) number of times that word w and a occur simultaneously in the expression corpus.
CN2013102791845A 2013-07-04 2013-07-04 Method and system for discovering new words in open fields Pending CN103294664A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013102791845A CN103294664A (en) 2013-07-04 2013-07-04 Method and system for discovering new words in open fields

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013102791845A CN103294664A (en) 2013-07-04 2013-07-04 Method and system for discovering new words in open fields

Publications (1)

Publication Number Publication Date
CN103294664A true CN103294664A (en) 2013-09-11

Family

ID=49095558

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013102791845A Pending CN103294664A (en) 2013-07-04 2013-07-04 Method and system for discovering new words in open fields

Country Status (1)

Country Link
CN (1) CN103294664A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750665A (en) * 2013-12-30 2015-07-01 腾讯科技(深圳)有限公司 Text message processing method and text message processing device
CN105183923A (en) * 2015-10-27 2015-12-23 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105224682A (en) * 2015-10-27 2016-01-06 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105389349A (en) * 2015-10-27 2016-03-09 上海智臻智能网络科技股份有限公司 Dictionary updating method and apparatus
CN105488098A (en) * 2015-10-28 2016-04-13 北京理工大学 Field difference based new word extraction method
CN106033462A (en) * 2015-03-19 2016-10-19 科大讯飞股份有限公司 Neologism discovering method and system
CN106445908A (en) * 2015-08-07 2017-02-22 阿里巴巴集团控股有限公司 Text identification method and apparatus
CN106815189A (en) * 2015-11-27 2017-06-09 镇江诺尼基智能技术有限公司 A kind of new verb identifying system of Chinese and method
CN106815190A (en) * 2015-11-27 2017-06-09 阿里巴巴集团控股有限公司 A kind of words recognition method, device and server
CN107038229A (en) * 2017-04-07 2017-08-11 云南大学 A kind of use-case extracting method based on natural semantic analysis
CN107463682A (en) * 2017-08-08 2017-12-12 深圳市腾讯计算机***有限公司 A kind of recognition methods of keyword and device
CN108062302A (en) * 2016-11-08 2018-05-22 北京国双科技有限公司 A kind of recognition methods of particular text information and device
CN108984513A (en) * 2017-06-05 2018-12-11 阿里巴巴集团控股有限公司 A kind of word string recognition methods and server
CN108984514A (en) * 2017-06-05 2018-12-11 中兴通讯股份有限公司 Acquisition methods and device, storage medium, the processor of word
CN110532539A (en) * 2018-05-24 2019-12-03 本识科技(深圳)有限公司 A kind of human-machine interactive information treating method and apparatus
CN110765239A (en) * 2019-10-29 2020-02-07 腾讯科技(深圳)有限公司 Hot word recognition method, device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101324883A (en) * 2008-07-31 2008-12-17 电子科技大学 Method for extracting variation key word
CN101477518A (en) * 2009-01-09 2009-07-08 昆明理工大学 Tour field named entity recognition method based on condition random field
CN102279890A (en) * 2011-09-02 2011-12-14 苏州大学 Sentiment word extracting and collecting method based on micro blog
CN102360383A (en) * 2011-10-15 2012-02-22 西安交通大学 Method for extracting text-oriented field term and term relationship

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101324883A (en) * 2008-07-31 2008-12-17 电子科技大学 Method for extracting variation key word
CN101477518A (en) * 2009-01-09 2009-07-08 昆明理工大学 Tour field named entity recognition method based on condition random field
CN102279890A (en) * 2011-09-02 2011-12-14 苏州大学 Sentiment word extracting and collecting method based on micro blog
CN102360383A (en) * 2011-10-15 2012-02-22 西安交通大学 Method for extracting text-oriented field term and term relationship

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈飞,刘奕群,魏超,张云亮,张敏,马少平: "基于条件随机场方法的开放领域新词发现", 《软件学报》, vol. 24, no. 5, 15 May 2013 (2013-05-15) *

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750665A (en) * 2013-12-30 2015-07-01 腾讯科技(深圳)有限公司 Text message processing method and text message processing device
CN106033462A (en) * 2015-03-19 2016-10-19 科大讯飞股份有限公司 Neologism discovering method and system
CN106033462B (en) * 2015-03-19 2019-11-15 科大讯飞股份有限公司 A kind of new word discovery method and system
CN106445908B (en) * 2015-08-07 2019-11-15 阿里巴巴集团控股有限公司 Text recognition method and device
CN106445908A (en) * 2015-08-07 2017-02-22 阿里巴巴集团控股有限公司 Text identification method and apparatus
CN105224682B (en) * 2015-10-27 2018-06-05 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN108897842B (en) * 2015-10-27 2021-04-09 上海智臻智能网络科技股份有限公司 Computer readable storage medium and computer system
CN108875040B (en) * 2015-10-27 2020-08-18 上海智臻智能网络科技股份有限公司 Dictionary updating method and computer-readable storage medium
CN105389349A (en) * 2015-10-27 2016-03-09 上海智臻智能网络科技股份有限公司 Dictionary updating method and apparatus
CN105224682A (en) * 2015-10-27 2016-01-06 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105183923A (en) * 2015-10-27 2015-12-23 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105389349B (en) * 2015-10-27 2018-07-27 上海智臻智能网络科技股份有限公司 Dictionary update method and device
CN108875040A (en) * 2015-10-27 2018-11-23 上海智臻智能网络科技股份有限公司 Dictionary update method and computer readable storage medium
CN108897842A (en) * 2015-10-27 2018-11-27 上海智臻智能网络科技股份有限公司 Computer readable storage medium and computer system
CN105488098A (en) * 2015-10-28 2016-04-13 北京理工大学 Field difference based new word extraction method
CN105488098B (en) * 2015-10-28 2019-02-05 北京理工大学 A kind of new words extraction method based on field otherness
CN106815189B (en) * 2015-11-27 2020-03-20 中科国力(镇江)智能技术有限公司 Method for identifying new Chinese verb
CN106815190B (en) * 2015-11-27 2020-06-23 阿里巴巴集团控股有限公司 Word recognition method and device and server
CN106815189A (en) * 2015-11-27 2017-06-09 镇江诺尼基智能技术有限公司 A kind of new verb identifying system of Chinese and method
CN106815190A (en) * 2015-11-27 2017-06-09 阿里巴巴集团控股有限公司 A kind of words recognition method, device and server
US11010554B2 (en) 2016-11-08 2021-05-18 Beijing Gridsum Technology Co., Ltd. Method and device for identifying specific text information
CN108062302B (en) * 2016-11-08 2019-03-26 北京国双科技有限公司 A kind of recognition methods of text information and device
CN108062302A (en) * 2016-11-08 2018-05-22 北京国双科技有限公司 A kind of recognition methods of particular text information and device
CN107038229B (en) * 2017-04-07 2020-07-17 云南大学 Use case extraction method based on natural semantic analysis
CN107038229A (en) * 2017-04-07 2017-08-11 云南大学 A kind of use-case extracting method based on natural semantic analysis
CN108984514A (en) * 2017-06-05 2018-12-11 中兴通讯股份有限公司 Acquisition methods and device, storage medium, the processor of word
CN108984513A (en) * 2017-06-05 2018-12-11 阿里巴巴集团控股有限公司 A kind of word string recognition methods and server
CN108984513B (en) * 2017-06-05 2022-03-04 阿里巴巴集团控股有限公司 Word string recognition method and server
CN107463682A (en) * 2017-08-08 2017-12-12 深圳市腾讯计算机***有限公司 A kind of recognition methods of keyword and device
CN110532539A (en) * 2018-05-24 2019-12-03 本识科技(深圳)有限公司 A kind of human-machine interactive information treating method and apparatus
CN110765239A (en) * 2019-10-29 2020-02-07 腾讯科技(深圳)有限公司 Hot word recognition method, device and storage medium
CN110765239B (en) * 2019-10-29 2023-03-28 腾讯科技(深圳)有限公司 Hot word recognition method, device and storage medium

Similar Documents

Publication Publication Date Title
CN103294664A (en) Method and system for discovering new words in open fields
CN104572958B (en) A kind of sensitive information monitoring method based on event extraction
Aisopos et al. Content vs. context for sentiment analysis: a comparative analysis over microblogs
CN103336766B (en) Short text garbage identification and modeling method and device
Hussain et al. Using linguistic knowledge to classify non-functional requirements in SRS documents
CN106055538A (en) Automatic extraction method for text labels in combination with theme model and semantic analyses
Al-Subaihin et al. A proposed sentiment analysis tool for modern arabic using human-based computing
CN102866989A (en) Viewpoint extracting method based on word dependence relationship
Kumar et al. IIT-TUDA: System for sentiment analysis in Indian languages using lexical acquisition
CN104598535A (en) Event extraction method based on maximum entropy
CN101661513A (en) Detection method of network focus and public sentiment
Albraheem et al. Exploring the problems of sentiment analysis in informal Arabic
Ljubešić et al. Standardizing tweets with character-level machine translation
El-Shishtawy et al. An accurate arabic root-based lemmatizer for information retrieval purposes
CN101833579B (en) Method and system for automatically detecting academic misconduct literature
Bam et al. Named entity recognition for nepali text using support vector machines
CN101702167A (en) Method for extracting attribution and comment word with template based on internet
Färber et al. A multidimensional dataset based on crowdsourcing for analyzing and detecting news bias
CN103778200A (en) Method for extracting information source of message and system thereof
Fromm et al. Towards a taxonomy of text mining features
CN102508830A (en) Method and system for extracting social network from news document
Swanson et al. Extracting the native language signal for second language acquisition
CN113033198A (en) Similar text pushing method and device, electronic equipment and computer storage medium
Parameswarappa et al. Kannada word sense disambiguation using decision list
Verhoeven et al. Gender profiling for Slovene Twitter communication: The influence of gender marking, content and style

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130911