CN103268339A

CN103268339A - Recognition method and system of named entities in microblog messages

Info

Publication number: CN103268339A
Application number: CN201310182978XA
Authority: CN
Inventors: 程学旗; 伍大勇; 李静远; 王元卓; 刘倩
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2013-05-17
Filing date: 2013-05-17
Publication date: 2013-08-28
Anticipated expiration: 2033-05-17
Also published as: CN103268339B

Abstract

The invention provides a recognition method of named entities in microblog messages. The recognition method includes that a few named entities are specified as seeds; a certain number of microblog messages from the original microblog message set to be processed are automatically marked as a training data set; and then the training data set is utilized to train a named entity identifier and the trained named entity recognizer is utilized to recognize the named entities in the microblog messages. According to the recognition method of the named entities in the microblog messages, only a few existing seed entities need to be specified to enable a high quality training set to be automatically marked; the labor costs are significantly reduced for the microblog messages which are texts capable of being updated rapidly; and an iterative mode is utilized to generate high quality marked data step by step and each time first N newly named entities which can most reflect the appearing law of the named entities in real microblog data are selected to add into a seed bank, so that finally generated marked data can well cover the integral microblog message set.

Description

Named entity recognition method and system in the Twitter message

Technical field

The present invention relates to network data processing and analysis, relate in particular to the method for named entity in the automatic identification Twitter message text.

Background technology

Microblogging is the information issue of a kind of new rise on the internet and the form of propagating, and microblogging receives Internet user's concern rapidly with the convenience of its mode that gives out information, brief, quick.Current domestic microblogging user is hundreds of millions of, and on these large-scale microblogging platforms of Sina, Tengxun, Sohu and Netease, the user produces a large amount of Twitter message texts every day, and for example, the Twitter message that increase newly every day in Sina's microblogging is near 100,000,000.On the microblogging platform, each Internet user is one " from medium ", be that the user can propagate their what is seen and heard by the issue Twitter message, and viewpoint, demand and the interest of expressing them, the microblogging platform is by gathering these message the massage set of formation magnanimity, and such magnanimity massage set has reflected user group's interest trend again.Obviously, from the Twitter message of these magnanimity, analyze named entities such as personage that the Internet user pays close attention to, place, mechanism, can provide important support information for various upper layer application such as internet marketing, colony's emotion analyses.An important core technology during this feasible named entity recognition in the Twitter message text becomes network data processing and analyzes.

Named entity recognition is to identify the entity of name property denotion in text, and for example the name in the text, place name and mechanism's name etc. can provide support for the upper layer application of various natural language analysises.Current named entity recognition research is normally identified towards normalized text, as scientific and technological document, news report etc., and the Twitter message text has the characteristics of himself, any as linguistic form, grammer is lack of standardization, sentential form is scattered etc., therefore existing named entity recognition method can not analyze the named entity that occurs in the Twitter message text exactly.And the data acquisition that possesses some scales that need manually mark out at existing named entity recognition method (also can be called the named entity recognition method that supervision is arranged) is as training data, training named entity recognition model, the labor standard corpus is the work of wasting time and energy, and marks relatively difficulty of large-scale training data.In addition, because Twitter message constantly increases in time and the microblogging content constantly changes, the mode that this employing manually marks the training set not only wastes time and energy, and can not reflect the characteristics of microblogging data timely and accurately, so recognition efficiency is low.At present, also do not carry out the method for named entity recognition at the Twitter message text.

Summary of the invention

Therefore, the objective of the invention is to overcome the defective of above-mentioned prior art, a kind of named entity recognition method that is applicable to the Twitter message text is provided, can identify the named entity in the Twitter message efficiently, and reduce the cost of artificial labeled data.

The objective of the invention is to be achieved through the following technical solutions:

On the one hand, the invention provides named entity recognition method in a kind of Twitter message, comprising:

Step 1 specifies a small amount of named entity as seed, from pending original Twitter message set automatically the microblogging of mark some as training dataset,

Step 2 is trained the named entity recognition device with this training dataset;

Step 3 utilizes the named entity recognition device that trains that the named entity in the Twitter message is identified.

In the said method, described step 1 can comprise:

Step 11) adopts a spot of such other named entity example that belongs to as initial kind fructification respectively for name, place name and three named entity classifications of mechanism's name;

Step 12), initial kind fructification is added seed bank, iteration generates template and generic rebaptism entity, deposits template base and seed bank respectively in; Wherein, described template be named entity regular length above and hereinafter;

Step 13), the named entity in template base and the seed bank and template are combined into phrase in twos, in original Twitter message set, extract the Twitter message text that comprises this phrase, go out the border of named entity with template matches, and the classification of this named entity of mark is the classification under the seed, thereby obtains the Twitter message text through mark; And with obtain all through the mark Twitter message texts as training dataset.

In the said method, generate template and generic rebaptism entity in iteration described in the described step 12), deposit template base in respectively and seed bank comprises:

Step (12-1) extracts all and comprises the microblogging statement of planting fructification in the seed bank, obtains the statement set as kind of a fructification;

Step (12-2) is obtained the kind fructification that occurs in each sentence in the statement set for kind of fructification, get its regular length above with hereinafter as template, the classification of this template of mark be the affiliated classification of this kind fructification; And all templates of obtaining are gathered as candidate template;

Step (12-3) is chosen top-quality top n template and is added template base from the candidate template set;

Step (12-4) extracts named entity in all microblogging data that do not mark with the template in the template base in original Twitter message set;

Step (12-5) selects preceding M high named entity of degree of confidence to add in the seed bank from all named entities that extract;

Step (12-6) repeating step (12-1) to (12-5) no longer enlarges or reaches predefined iterations up to seed bank, and wherein M and N are the integer greater than 1.

In the said method, described step (12-3) can comprise:

Extract the feature of each template in the candidate template set, the feature of described template comprises accuracy rate, template frequency and the template intensity of the diversity of entity, entity;

On each feature, template is sorted, select the best top n template of comprehensive evaluation, add template base;

Wherein, the diversity of the entity number of obtaining the named entity inequality that extracts the statement set from kind of fructification with template characterizes;

The entity accuracy rate equals quantity that template obtains the named entity in the seed bank that extracts the statement set from kind of fructification and template and obtains the ratio of the sum of the named entity of extraction the statement set from kind of fructification;

The template frequency equals template and obtains the number of times that occurs in the statement set and plant the ratio that fructification is obtained the template sum that comprises in the statement set in kind of fructification;

The template intensity is calculated as follows:

Wherein, the quantity of the classification that occurred of template refers in this template that the classification of obtaining the named entity that extracts the statement set from kind of fructification has several; Total categorical measure of entity is that the classification of the named entity that comprises in the seed bank has several.

In the said method, the degree of confidence of named entity described in the described step (12-5) equals this named entity and extract the product of number of times that number of times that the template of this named entity occurs simultaneously occurs divided by this named entity and this template number of times of appearance in original Twitter message set in original Twitter message set in original Twitter message set.

In the said method, described step (12-2) can comprise:

Obtain the kind fructification that occurs in each sentence in the statement set for kind of fructification, get each four character length before and after it above and hereinafter;

To above and hereinafter mating with Chinese everyday words vocabulary respectively of getting, the maximum length vocabulary that matches is as template, adjacent when the monosyllabic word when running into this kind fructification, enlarge the coupling character length again, till matching another vocabulary, when running into can't mate commonly used vocabulary the time, getting four word length character strings is template;

The classification of this template of mark is the classification under this kind fructification;

All templates of obtaining are gathered as candidate template.

In the said method, described step 2 can comprise

Step 2-1) every text in each Twitter message text of training dataset is marked sequence as one, each mark unit of mark sequence is a Chinese character, utilizes external language knowledge vocabulary to extract external language knowledge feature respectively for each mark unit that marks sequence;

Step 2-2) uses the external language knowledge feature that extracts, adopt conditional random field models training named entity recognition device.

In the said method, described step 2-1) in, described external language knowledge vocabulary comprises: vocabulary 1, organizational structure's suffix vocabulary; Vocabulary 2, the place name summary sheet; Vocabulary 3, place name suffix table; Vocabulary 4, the surname table; Vocabulary 5, roster's name table commonly used; Vocabulary 6, the previous word table of name double word commonly used; Vocabulary 7, word table after the name double word commonly used; Vocabulary 8, the appellation tabulation; Vocabulary 9: microblogging famous person table (personage who adds the V authentication in the microblogging); Vocabulary 10, the common wordss table; Vocabulary 11, individual character vocabulary commonly used.

In the said method, at described step 2-1) in, for each mark unit, the external language knowledge feature that extract comprises:

Whether 1) current mark unit (each Chinese character) appears in any of vocabulary 1-vocabulary 11;

Whether 2) current mark unit appears in vocabulary 1, vocabulary 2, vocabulary 3, vocabulary 8, vocabulary 9 or the vocabulary 10 with a character that before connects;

3) whether current mark unit and two characters before connecing are in vocabulary 1, vocabulary 2, vocabulary 3, vocabulary 8, vocabulary 9 or vocabulary 10;

4) whether current mark unit and three characters before connecing are in vocabulary 1, vocabulary 2, vocabulary 8, vocabulary 9 or vocabulary 10;

5) whether current mark unit and four characters before connecing are in vocabulary 1, vocabulary 2 or vocabulary 10;

6) current mark unit appears in the vocabulary 5, and a preceding character that connects occurs in the vocabulary 4;

7) current mark unit appears in the vocabulary 6, and a preceding character that connects occurs in the vocabulary 4;

8) current mark unit appears in the vocabulary 7, and a preceding character that connects occurs in the vocabulary 6;

9) current mark unit appears in the vocabulary 7, and preceding two characters that connect, and appears at respectively in vocabulary 6 and the vocabulary 4;

10) whether the vocabulary that constitutes of current mark unit and two characters that before connect or a preceding character that connects is included in the vocabulary 9.

Another aspect the invention provides named entity recognition system in a kind of Twitter message, comprising:

10. named entity recognition system in the Twitter message, this system comprises:

Automatically annotation equipment is used for based on a small amount of named entity that is designated as seed, marks the microblogging of some automatically as training dataset from pending original Twitter message set;

Trainer is used for utilizing training dataset to train the named entity recognition device;

Recognition device, the named entity recognition device that its utilization trains is identified the named entity in the Twitter message.

Wherein, described trainer also is used for:

Every text in each Twitter message text of training dataset is marked sequence as one, each mark unit of mark sequence is a Chinese character, utilizes external language knowledge vocabulary to extract external language knowledge feature respectively for each mark unit that marks sequence; And

Use the external language knowledge feature that extracts, adopt conditional random field models training named entity recognition device.

Compared with prior art, the invention has the advantages that:

1, only needs to specify a small amount of existing kind of fructification, just can mark high-quality training set automatically.For this renewal speed of Twitter message text faster, significantly reduce cost of labor.

2, the mode with iteration progressively produces high-quality labeled data, each top n rebaptism entity of selecting to best embody named entity occurrence law in the true microblogging data adds to seed bank, and the final labeled data that generates can well cover whole microblogging data set.

3, conventional needle is launched on the basis of Chinese word segmentation mostly to the recognition methods of specification document, and microblogging text term is lack of standardization, comprise a large amount of abbreviations, do not meet grammatical term, ambiguity word and neologisms, the mode based on word is adopted in recognition methods of the present invention, has avoided the error accumulation that the participle mistake causes when named entity recognition.

Description of drawings

It is following that embodiments of the present invention is further illustrated with reference to accompanying drawing, wherein:

Fig. 1 is according to named entity recognition method schematic flow sheet in the Twitter message of the embodiment of the invention;

Fig. 2 marks the Twitter message text to produce the process synoptic diagram of training data automatically for the seed named entity that utilizes according to the embodiment of the invention;

Fig. 3 is for training the process synoptic diagram of named entity recognition device according to an embodiment of the invention.

Embodiment

In order to make purpose of the present invention, technical scheme and advantage are clearer, and the present invention is described in more detail by specific embodiment below in conjunction with accompanying drawing.Should be appreciated that specific embodiment described herein only in order to explaining the present invention, and be not used in restriction the present invention.

Fig. 1 has provided named entity recognition method in the Weakly supervised according to an embodiment of the invention Twitter message.This method comprises: step 1) specifies a spot of named entity as seed, automatically marks the microblogging data of some in the pending original Twitter message set (or original Twitter message database) as the training dataset of training named entity recognition device; Step 2) trains the name recognizer based on this training dataset; The name recognizer that the step 3) utilization trains is identified the named entity in the Twitter message.What wherein store in pending original Twitter message set or the database is through pretreated Twitter message text, pre-service to the Twitter message data of gathering can comprise the body text that extracts microblogging, filter out the special symbol of html tag, non-punctuate, full half-angle conversion that punctuation mark is unified etc.

In step 1, the microblogging data are marked automatically comprise and mark out named entity (also can the abbreviate entity as) border in the Twitter message text and the classification of named entity.Fig. 2 has provided and has utilized the seed named entity to mark the Twitter message text automatically to produce the process synoptic diagram of training data.As shown in Figure 2, step 1 comprises:

Step 11) specifies a spot of named entity as seed.For example, for name, place name and three named entity classifications of mechanism's name, adopt a spot of such other named entity example that belongs to as seed respectively.When the named entity of selecting as seed, can select the higher named entity of the frequency of occurrences in the microblogging data, thereby can make following step can more easily obtain the sentence that comprises this named entity, to improve marking efficiency.Listed some examples as all kinds of named entity examples of seed in the table 1, table 1 at be these data of 1 month on August 31,1 day to 2012 August in 2012 from Sina's microblogging is gathered.In table 1 for these three types of name, place name and mechanism's names given respectively 10 belong to such other named entity as seed.Named entity as seed also can be called kind of fructification or seed named entity.

Table 1

Step 12), from initial seed named entity, iteration generates the generic rebaptism entity of high-quality template and high confidence level, deposits template base and seed bank respectively in; Wherein, described template be named entity regular length above and hereinafter, for example, the template of seed named entity " lindane " in sentence " top-seeded Chinese player lindane is poised for battle Japanese player Zuo Zuo wood Xiang as grand sports meet " can be " player # is poised for battle ".

Step 13), template base is combined into phrase in twos with similar named entity and template in the seed bank, in original Twitter message database, extract the Twitter message text that comprises this phrase, go out the border of named entity with template matches, the classification of this named entity of mark is the classification under the seed, has so just obtained the Twitter message text through mark.All that will obtain are named the training dataset of recognizer through the Twitter message text of mark as being used for training at last.

More specifically, according to one embodiment of present invention, from the initial seed named entity, iteration generates the generic rebaptism entity of high-quality template and high confidence level, deposits template base and named entity storehouse respectively in, may further comprise the steps in step 12):

Step 1-2-1), set template base P, seed bank S, initialization template base P be empty, initialization seed storehouse S is the specified name as seed, place name and mechanism's name example, routine seed named entity as shown in table 1.

Step 1-2-2), extract all and comprise in the seed bank microblogging statement of planting fructification, be designated as kind of a fructification and obtain statement set Ds, Ds will be for generation of the quality of template and assessment template;

Step 1-2-3), for obtain the kind fructification that occurs in each sentence among the statement set Ds in kind of fructification, get the context (comprising regular punctuation mark) of regular length before and after it, for example each four character length above and hereinafter; To above and hereinafter respectively with a Chinese everyday words vocabulary (vocabulary that in common document, often occurs just, for example, word, rural area etc.) mate, the maximum length vocabulary that matches is as template, when run into entity adjacent for monosyllabic word (for example, " ", " with ", " reaching " such word) time, enlarge the coupling character length again, till matching another vocabulary, when running into can't mate commonly used vocabulary the time, getting four word length character strings is template.The classification of this template of mark is the classification under the seed, is designated as candidate template set Pc.For example, be example with " Zhao Benshan " in the table 2, extracted " Zhao Benshan head the list of signers TV play rural area love story ", " Zhao Benshan participates in the Spring Festival Gala of Jiangsu satellite TV ", microblogging statements such as " seen Zhao Benshan's Spring Festival Gala essay, sensation are well ", can extract " Start# heads the list of signers ", " Start# participation ", " having seen the Spring Festival Gala of # " such template, wherein " Start " expression sentence is initial.

Step 1-2-4), obtain among the statement set Ds in kind of a fructification with all templates among the candidate template set Pc and oppositely extract named entity again, all results that note extracts extract string assemble St for template, the character string that in St, comprises three classes, one class is kind of a fructification, another kind of is new similar entity, a remaining class is the noise character string, because the second class novel entities negligible amounts, so can be according to the quality of the extraction situation analysis template of St in the step below.

For example, use Start# to head the list of signers ", " Start# participation ", templates such as " having seen the Spring Festival Gala of # " extracts entity again in Ds, and among the string assemble St that extracts, this is the first kind can to comprise " Zhao Benshan "; Also may comprise " Huang Hong ", this is second class; Also may comprise by " Start# participation " template character string " participating in tomorrow ... " the noise character string " tomorrow " that extracts, this is the 3rd class.So just need to analyze the extraction better quality of which template.

Step 1-2-5), extract the feature of template among the candidate template set Pc, analyze the quality of template, select the best top n template of comprehensive evaluation, add high-quality template base.Candidate template is gathered the feature that template is carried out quality analysis among the Pc, comprise the diversity of entity, accuracy rate, template frequency and the template intensity of entity, on each feature, template is sorted, select the best top n template of comprehensive evaluation, add high-quality template base.

Wherein, the feature of described template is calculated as follows:

1) quantity of the entity of entity diversity=inequality

That is to say that for a template in the candidate template set, the entity diversity refers to obtain the entity number inequality that extracts the statement set Ds with this template from kind of fructification.For example, " listening the song of # " with template is example, extracted " listening Chinese good sound; still very big very big in the song perceptual difference distance of listening that English ", " listen the song of Huang Qishan, that cries loses consciousness the sky secretly ", microbloggings such as " song of listening Huang Qishan are really to feel well; high pitch is said just last ", the entity inequality that extracts (that English, Huang Qishan) number is 2, so the entity diversity is 2.

2) described entity accuracy rate is calculated as follows

Just, for each template, add up the sum of its entity that from the Ds set, extracts." having seen the Spring Festival Gala of # " with template is example, extracted " having seen Zhao Benshan's Spring Festival Gala essay ", " there is nobody to see the Spring Festival Gala of Jiangsu satellite TV ", microblogging such as " seen the Spring Festival Gala of lichee platform; laughed at also and cried ", all entities that this template extracts add up to 3, are respectively Zhao Benshan, Jiangsu satellite TV and lichee platform, the physical quantities that appears in the seed set (being seed bank) is that 1(is Zhao Benshan), so the entity accuracy rate is 1/3.

3) described template frequency computation part is as follows

4) described template intensity is calculated as follows

For certain template, the quantity of the classification that template occurred refers to that the type of the named entity that extracts in this template has several.For example, still " having seen the Spring Festival Gala of # " with above-mentioned template is example, has extracted above-mentioned three microbloggings, and wherein Zhao Benshan is name, and Jiangsu satellite TV and lichee platform are mechanism's names, so the quantity of the classification that template occurred is 2.Have several and total categorical measure of entity refers to the classification of the named entity that comprises in the seed bank.

In conjunction with above-mentioned feature the quality of each template is carried out comprehensive evaluation, select the top-quality top n template of comprehensive evaluation.After for example the value of certain all feature of template can being carried out normalization, get the product of all features, be used as the comprehensive evaluation index of this template.Perhaps also can be with the weighted array of all features of certain template comprehensive evaluation index as this template, weight arranges according to real data environment or system requirements.Certainly, above-mentioned feature only is illustrational purpose, also can adopt arbitrary combination of above-mentioned feature in other embodiments.

Step 1-2-6), in original microblogging database, extract named entity in all Twitter message data that do not mark with high-quality template in the template base, all results of extracting of note are candidate's entity sets Ec.

For example, be example with template " player # be poised for battle ", directly by string matching, in whole original microblogging database, can match such as " the pretty sunshine of Chinese player fourth is poised for battle England player and is filled in Bill ", microbloggings such as " [world's Snooker match among the masters] Chinese player Zhou Yuelong are poised for battle White ", therefore the named entity that extracts is " Ding Junhui " and " Zhou Yuelong ".

Step 1-2-7), utilize the degree of confidence of candidate's named entity and the high-quality template cooccurrence relation calculated candidate named entity in original Twitter message database, select preceding M high entity of degree of confidence to add in the seed bank.This step can be controlled noise preferably

That is to say that the degree of confidence of candidate's named entity equals this named entity and extract the product of number of times that number of times that the template of this named entity occurs simultaneously occurs divided by this named entity and this template number of times of appearance in original Twitter message set in original Twitter message set in original Twitter message set.

Step 1-2-8), repeating step 1-2-2) to step 1-2-7), no longer enlarge or reach default iterations up to seed bank.In above-mentioned steps, M and N are the integer greater than 1, can set according to user or system's actual demand.

Step 13), for the expansion after seed bank and any two tuples＜seed, the template in the template base, be combined into phrase in twos, in original Twitter message database, extract the Twitter message text that comprises this phrase, go out the border of named entity with template matches, the classification of this named entity of mark is the classification under the seed, the final training data that meets the microblogging data characteristics that generates.These training datas are the Twitter message text that comprises high-quality template and high-quality kind fructification, and it will be as the mark language material of training named entity recognition device.

For example the name entity " Liu Huan " in the seed bank and the name template in the template base " are listened the song of # ", be combined into phrase and " listen the song of Liu Huan ", in original microblogging database, can extract microblogging " ... the song of listening Liu Huan after growing up is again seen TV " the Water Margin " ... ", the name entity border of " listening the song of # " and can locating in this microblogging with template is " Liu Huan ", and the classification of mark-up entity " Liu Huan " is name.

Step 2, use the Twitter message data of the automatic mark of step 1 as training dataset, the named entity recognition device that utilizes this training dataset to train to be applicable to Twitter message just can utilize the named entity recognition device that trains to identify named entity in the Twitter message then.Wherein, the named entity recognition device can wait to train based on Hidden Markov Model (HMM), maximum entropy model, conditional random field models.The feature that is used for training named entity recognition device can be determined according to real data environment and system requirements.But the feature that adopts can influence accuracy of identification and the efficient of the name recognizer of training.

In a preferred embodiment of the invention, for accuracy of identification and the efficient that improves the named entity recognition device better, in step 2, use the Twitter message data of the automatic mark of step 1 as training set, extract the multilingual knowledge feature of training data, come the training condition random field (Conditional Random Field, CRF) model, thereby obtain being applicable to the named entity recognition device of Twitter message.Wherein, the multilingual knowledge feature of training data can extract according to the feature extraction array configuration that designs from the linguistry of multiple outside.

Fig. 3 has provided the process synoptic diagram of training the named entity recognition device according to an embodiment of the invention.Wherein, described external language knowledge can show as the form of external language knowledge vocabulary.Described step 2 mainly can comprise:

Step 2-1) every text in each Twitter message text of training set is marked sequence as one, each mark unit of mark sequence is a Chinese character, utilizes external language knowledge vocabulary to carry out feature extraction respectively for each mark unit that marks sequence.When running into numeral or foreign language character string, whole numeric string or foreign language character string are a mark unit.Every text refers to the character string of being separated by comma, fullstop, exclamation or branch, when a plurality of such Symbolic Links, it is considered as a sentence list separator.

The external language knowledge vocabulary that adopts can comprise: vocabulary 1-organizational structure suffix vocabulary; Vocabulary 2-place name summary sheet; Vocabulary 3-place name suffix table; Vocabulary 4-surname table; Vocabulary 5-roster's name table commonly used; Vocabulary 6-previous the word table of name double word commonly used; Word table after vocabulary 7-name double word commonly used; The tabulation of vocabulary 8-appellation; Vocabulary 9: microblogging famous person table (personage who adds the V authentication in the microblogging); Vocabulary 10-common wordss table; Vocabulary 11-individual character vocabulary commonly used.

For each mark unit, the external language knowledge feature that extract comprises:

Whether 1) current mark unit (each Chinese character) appears in any of vocabulary 1～vocabulary 11.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".

Whether 2) current mark unit appears in vocabulary 1, vocabulary 2, vocabulary 3, vocabulary 8, vocabulary 9 or the vocabulary 10 with a character that before connects.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".

3) whether current mark unit and two characters before connecing are in vocabulary 1, vocabulary 2, vocabulary 3, vocabulary 8, vocabulary 9 or vocabulary 10.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".

4) whether current mark unit and three characters before connecing are in vocabulary 1, vocabulary 2, vocabulary 8, vocabulary 9 or vocabulary 10.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".

5) whether current mark unit and four characters before connecing are in vocabulary 1, vocabulary 2 or vocabulary 10.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".

6) current mark unit appears in the vocabulary 5, and a preceding character that connects occurs in the vocabulary 4.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".

7) current mark unit appears in the vocabulary 6, and a preceding character that connects occurs in the vocabulary 4.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".

8) current mark unit appears in the vocabulary 7, and a preceding character that connects occurs in the vocabulary 6.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".

9) current mark unit appears in the vocabulary 7, and preceding two characters that connect, and appears at respectively in vocabulary 6 and the vocabulary 4.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".

10) whether the vocabulary that constitutes of current mark unit and two characters that before connect or a preceding character that connects is included in the vocabulary 9 and (adopts the longest coupling).If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".

Like this, for each mark unit, can obtain the proper vector of being formed by 10 elements.With the Twitter message text " add up according to Haidian tax bureau .... " be example, for character " office " wherein, the vocabulary " tax bureau " (namely appearing in the table 1) that it meets current mark unit and two characters before connecing is vocabulary 1, so to mark unit " office " 3) feature is taken as very; This character " office " also appears in any of vocabulary 1～vocabulary 11, therefore for so to mark unit " office " 1) feature is taken as very; In this way, the proper vector that can obtain marking unit " office " is (1,1,1,0,0,0,0,0,0,0).In this way, for each Twitter message text, can obtain the eigenmatrix of being formed by the proper vector of its contained all characters.

Training mainly comprises based on the process of the named entity recognition device of CRF model: at first, Twitter message text for mark, wherein a label given in each character, adopt four class label Bs, I, E, O, B represents the beginning character of entity, I represents the entity intermediate character, E represents the entity last character, and O represents non-entity character.For example mark " see/O/Zhao O/Bp basis/Ip mountain/Ep /O spring/O evening/O is little/O product/O ", wherein Bp, Ip, Ep represent beginning, the centre of name entity and finish, just know the position of entity in the sentence like this according to mark, training named entity recognition model is exactly to carry out parameter estimation according to the Twitter message of such mark.Then, according to above-mentioned external language knowledge feature extraction method, labeled data is carried out after the feature extraction, the Twitter message data of every mark are expressed as combination of features, and the input of the characteristic set of all mark Twitter messages as the CRF model.Training process is the weight of estimating each fundamental function according to maximum likelihood method.After the weight of having obtained all fundamental functions, just mean to have obtained the named entity recognition device that trains.

Then, for Twitter message to be identified, the external language knowledge feature of utilizing external language knowledge vocabulary to extract this Twitter message as indicated above, then, the named entity recognition device that utilization trains, adopt viterbi algorithm to calculate it and have the mark sequence results of maximum probability, thereby identify the named entity in this Twitter message.

In yet another embodiment of the present invention, also provide named entity recognition system in a kind of Twitter message, this system comprises automatic annotation equipment, trainer and recognition device.Wherein, annotation equipment adopts method mentioned above based on a small amount of named entity that is designated as seed automatically, marks the microblogging of some automatically as training dataset from pending original Twitter message set.Trainer adopts method mentioned above to utilize training dataset to train the named entity recognition device.Recognition device utilizes the named entity recognition device that trains that the named entity in the Twitter message is identified.

For verifying the performance of this method, the inventor also tests, and is example with Sina's microblogging, in all Twitter messages that collect, chooses the data (about 5,000 ten thousand) of this month on the 31st in 1 day to 2012 August of August in 2012 as the target data set of experiment.Use Weakly supervised method of the present invention to carry out named entity recognition at this data set.The named entity classification of identification is name, place name and mechanism's name.In order to estimate recognition effect of the present invention, picked at random part Twitter message manually marks, as test data set.The details of test data set have been listed in the table 2.

Table 2

The microblogging sum	Name quantity	Place name quantity	Mechanism's name quantity
				15379	1654	1556	1055

In conjunction with method mentioned above and given object of experiment data set, specifically experiment and test process are as follows:

1), the Twitter message that collects is carried out pre-service, extracts body text, filter out the character beyond the regular punctuation marks of Chinese such as special character, html tag, deposit original microblogging database in;

2), for name, place name and three named entity classifications of mechanism's name, given 10 belong to such other example as seed respectively, as mentioned listed all kinds of kinds of fructifications in the table 1;

3), adopt automatic annotation step mentioned above, in original microblogging database, pick out the microblogging data of a part automatically, mark automatically;

4), utilize after organizational structure's vocabulary mentioned above, place name summary sheet, place name suffix table, surname table, roster's name table commonly used, the name double word commonly used a word table, the previous word table of name double word commonly used, appellation tabulation, common wordss table, 11 vocabularys of individual character vocabulary commonly used, the external language knowledge feature in the microblogging data that further extraction has marked.

5), the microblogging data of using mark automatically are as training set, in conjunction with a plurality of features of extracting, train the CRF model, to obtain being applicable to the named entity recognition device of microblogging.

To recognition effect evaluating standard of the present invention be: precision ratio and recall ratio, and the F1 value of considering precision ratio and recall ratio simultaneously.Table 3 has been listed the recognition effect of named entity recognition method on name, place name and mechanism's name according to the embodiment of the invention.

(1) quantity of the precision ratio proper naming entity that equals to identify is divided by the quantity of all named entities that identify.

(2) quantity of the recall ratio proper naming entity that equals to identify is divided by the total quantity of all named entities that comprise in the Twitter message.

Table 3

Classification	Precision ratio	Recall ratio	The F1 value
				Name	85.8%	81.3%	83.5%
Place name	87.2%	84.7%	85.9%
				Mechanism's name	80.1%	78.4%	79.2%

Specific embodiment by above as can be seen, compare with traditional named entity recognition method that supervision is arranged, microblogging named entity recognition method provided by the invention has been saved the cost of artificial labeled data, only needs to specify a small amount of existing kind of fructification, just can mark high-quality training set automatically.For this renewal speed of Twitter message text faster, significantly reduce cost of labor.And, the present invention progressively produces high-quality labeled data with the mode of iteration, each select to best embody that preceding M rebaptism entity of named entity occurrence law adds to seed bank in the true microblogging data, the labeled data of final generation can well cover whole microblogging data set.In addition, traditional recognition methods launches on the basis of Chinese word segmentation mostly, and the microblogging term is lack of standardization, comprise a large amount of abbreviations, do not meet grammatical term, ambiguity word and neologisms, it is the mode of unit that method of the present invention adopts with the word when training name recognizer, has avoided the error accumulation that the participle mistake causes when named entity recognition.

Though the present invention is described by preferred embodiment, yet the present invention is not limited to embodiment as described herein, also comprises various changes and the variation done without departing from the present invention.

Claims

1. named entity recognition method in the Twitter message, this method comprises:

Step 1 specifies a small amount of named entity as seed, marks the microblogging of some automatically as training dataset from pending original Twitter message set;

2. method according to claim 1, described step 1 comprises:

3. method according to claim 2 generates template and generic rebaptism entity in iteration described in the described step 12), deposits template base in respectively and seed bank comprises:

4. method according to claim 3, described step (12-3) comprising:

The template intensity is calculated as follows:

5. method according to claim 3, in the described step (12-5), the degree of confidence of described named entity equals this named entity and extract the product of number of times that number of times that the template of this named entity occurs simultaneously occurs divided by this named entity and this template number of times of appearance in original Twitter message set in original Twitter message set in original Twitter message set.

6. method according to claim 3, described step (12-2) comprising:

All templates of obtaining are gathered as candidate template.

7. according to each described method among the claim 1-6, described step 2 comprises

8. method according to claim 7, described step 2-1) in, described external language knowledge vocabulary comprises: vocabulary 1, organizational structure's suffix vocabulary; Vocabulary 2, the place name summary sheet; Vocabulary 3, place name suffix table; Vocabulary 4, the surname table; Vocabulary 5, roster's name table commonly used; Vocabulary 6, the previous word table of name double word commonly used; Vocabulary 7, word table after the name double word commonly used; Vocabulary 8, the appellation tabulation; Vocabulary 9, microblogging famous person table; Vocabulary 10, the common wordss table; Vocabulary 11, individual character vocabulary commonly used.

9. method according to claim 8 is at described step 2-1) in, for each mark unit, the external language knowledge feature that extract comprises:

Whether a) current mark unit appears in any of vocabulary 1-vocabulary 11;

Whether b) current mark unit appears in vocabulary 1, vocabulary 2, vocabulary 3, vocabulary 8, vocabulary 9 or the vocabulary 10 with a character that before connects;

C) whether current mark unit and two characters before connecing are in vocabulary 1, vocabulary 2, vocabulary 3, vocabulary 8, vocabulary 9 or vocabulary 10;

D) whether current mark unit and three characters before connecing are in vocabulary 1, vocabulary 2, vocabulary 8, vocabulary 9 or vocabulary 10;

E) whether current mark unit and four characters before connecing are in vocabulary 1, vocabulary 2 or vocabulary 10;

F) current mark unit appears in the vocabulary 5, and a preceding character that connects occurs in the vocabulary 4;

G) current mark unit appears in the vocabulary 6, and a preceding character that connects occurs in the vocabulary 4;

H) current mark unit appears in the vocabulary 7, and a preceding character that connects occurs in the vocabulary 6;

I) current mark unit appears in the vocabulary 7, and preceding two characters that connect, and appears at respectively in vocabulary 6 and the vocabulary 4;

J) whether the vocabulary that constitutes of current mark unit and two characters that before connect or a preceding character that connects is included in the vocabulary 9.

11. system according to claim 10, described trainer also is used for: