CN103268339A - Recognition method and system of named entities in microblog messages - Google Patents

Recognition method and system of named entities in microblog messages Download PDF

Info

Publication number
CN103268339A
CN103268339A CN201310182978XA CN201310182978A CN103268339A CN 103268339 A CN103268339 A CN 103268339A CN 201310182978X A CN201310182978X A CN 201310182978XA CN 201310182978 A CN201310182978 A CN 201310182978A CN 103268339 A CN103268339 A CN 103268339A
Authority
CN
China
Prior art keywords
vocabulary
template
named entity
twitter message
fructification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310182978XA
Other languages
Chinese (zh)
Other versions
CN103268339B (en
Inventor
程学旗
伍大勇
李静远
王元卓
刘倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201310182978.XA priority Critical patent/CN103268339B/en
Publication of CN103268339A publication Critical patent/CN103268339A/en
Application granted granted Critical
Publication of CN103268339B publication Critical patent/CN103268339B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a recognition method of named entities in microblog messages. The recognition method includes that a few named entities are specified as seeds; a certain number of microblog messages from the original microblog message set to be processed are automatically marked as a training data set; and then the training data set is utilized to train a named entity identifier and the trained named entity recognizer is utilized to recognize the named entities in the microblog messages. According to the recognition method of the named entities in the microblog messages, only a few existing seed entities need to be specified to enable a high quality training set to be automatically marked; the labor costs are significantly reduced for the microblog messages which are texts capable of being updated rapidly; and an iterative mode is utilized to generate high quality marked data step by step and each time first N newly named entities which can most reflect the appearing law of the named entities in real microblog data are selected to add into a seed bank, so that finally generated marked data can well cover the integral microblog message set.

Description

Named entity recognition method and system in the Twitter message
Technical field
The present invention relates to network data processing and analysis, relate in particular to the method for named entity in the automatic identification Twitter message text.
Background technology
Microblogging is the information issue of a kind of new rise on the internet and the form of propagating, and microblogging receives Internet user's concern rapidly with the convenience of its mode that gives out information, brief, quick.Current domestic microblogging user is hundreds of millions of, and on these large-scale microblogging platforms of Sina, Tengxun, Sohu and Netease, the user produces a large amount of Twitter message texts every day, and for example, the Twitter message that increase newly every day in Sina's microblogging is near 100,000,000.On the microblogging platform, each Internet user is one " from medium ", be that the user can propagate their what is seen and heard by the issue Twitter message, and viewpoint, demand and the interest of expressing them, the microblogging platform is by gathering these message the massage set of formation magnanimity, and such magnanimity massage set has reflected user group's interest trend again.Obviously, from the Twitter message of these magnanimity, analyze named entities such as personage that the Internet user pays close attention to, place, mechanism, can provide important support information for various upper layer application such as internet marketing, colony's emotion analyses.An important core technology during this feasible named entity recognition in the Twitter message text becomes network data processing and analyzes.
Named entity recognition is to identify the entity of name property denotion in text, and for example the name in the text, place name and mechanism's name etc. can provide support for the upper layer application of various natural language analysises.Current named entity recognition research is normally identified towards normalized text, as scientific and technological document, news report etc., and the Twitter message text has the characteristics of himself, any as linguistic form, grammer is lack of standardization, sentential form is scattered etc., therefore existing named entity recognition method can not analyze the named entity that occurs in the Twitter message text exactly.And the data acquisition that possesses some scales that need manually mark out at existing named entity recognition method (also can be called the named entity recognition method that supervision is arranged) is as training data, training named entity recognition model, the labor standard corpus is the work of wasting time and energy, and marks relatively difficulty of large-scale training data.In addition, because Twitter message constantly increases in time and the microblogging content constantly changes, the mode that this employing manually marks the training set not only wastes time and energy, and can not reflect the characteristics of microblogging data timely and accurately, so recognition efficiency is low.At present, also do not carry out the method for named entity recognition at the Twitter message text.
Summary of the invention
Therefore, the objective of the invention is to overcome the defective of above-mentioned prior art, a kind of named entity recognition method that is applicable to the Twitter message text is provided, can identify the named entity in the Twitter message efficiently, and reduce the cost of artificial labeled data.
The objective of the invention is to be achieved through the following technical solutions:
On the one hand, the invention provides named entity recognition method in a kind of Twitter message, comprising:
Step 1 specifies a small amount of named entity as seed, from pending original Twitter message set automatically the microblogging of mark some as training dataset,
Step 2 is trained the named entity recognition device with this training dataset;
Step 3 utilizes the named entity recognition device that trains that the named entity in the Twitter message is identified.
In the said method, described step 1 can comprise:
Step 11) adopts a spot of such other named entity example that belongs to as initial kind fructification respectively for name, place name and three named entity classifications of mechanism's name;
Step 12), initial kind fructification is added seed bank, iteration generates template and generic rebaptism entity, deposits template base and seed bank respectively in; Wherein, described template be named entity regular length above and hereinafter;
Step 13), the named entity in template base and the seed bank and template are combined into phrase in twos, in original Twitter message set, extract the Twitter message text that comprises this phrase, go out the border of named entity with template matches, and the classification of this named entity of mark is the classification under the seed, thereby obtains the Twitter message text through mark; And with obtain all through the mark Twitter message texts as training dataset.
In the said method, generate template and generic rebaptism entity in iteration described in the described step 12), deposit template base in respectively and seed bank comprises:
Step (12-1) extracts all and comprises the microblogging statement of planting fructification in the seed bank, obtains the statement set as kind of a fructification;
Step (12-2) is obtained the kind fructification that occurs in each sentence in the statement set for kind of fructification, get its regular length above with hereinafter as template, the classification of this template of mark be the affiliated classification of this kind fructification; And all templates of obtaining are gathered as candidate template;
Step (12-3) is chosen top-quality top n template and is added template base from the candidate template set;
Step (12-4) extracts named entity in all microblogging data that do not mark with the template in the template base in original Twitter message set;
Step (12-5) selects preceding M high named entity of degree of confidence to add in the seed bank from all named entities that extract;
Step (12-6) repeating step (12-1) to (12-5) no longer enlarges or reaches predefined iterations up to seed bank, and wherein M and N are the integer greater than 1.
In the said method, described step (12-3) can comprise:
Extract the feature of each template in the candidate template set, the feature of described template comprises accuracy rate, template frequency and the template intensity of the diversity of entity, entity;
On each feature, template is sorted, select the best top n template of comprehensive evaluation, add template base;
Wherein, the diversity of the entity number of obtaining the named entity inequality that extracts the statement set from kind of fructification with template characterizes;
The entity accuracy rate equals quantity that template obtains the named entity in the seed bank that extracts the statement set from kind of fructification and template and obtains the ratio of the sum of the named entity of extraction the statement set from kind of fructification;
The template frequency equals template and obtains the number of times that occurs in the statement set and plant the ratio that fructification is obtained the template sum that comprises in the statement set in kind of fructification;
The template intensity is calculated as follows:
Wherein, the quantity of the classification that occurred of template refers in this template that the classification of obtaining the named entity that extracts the statement set from kind of fructification has several; Total categorical measure of entity is that the classification of the named entity that comprises in the seed bank has several.
In the said method, the degree of confidence of named entity described in the described step (12-5) equals this named entity and extract the product of number of times that number of times that the template of this named entity occurs simultaneously occurs divided by this named entity and this template number of times of appearance in original Twitter message set in original Twitter message set in original Twitter message set.
In the said method, described step (12-2) can comprise:
Obtain the kind fructification that occurs in each sentence in the statement set for kind of fructification, get each four character length before and after it above and hereinafter;
To above and hereinafter mating with Chinese everyday words vocabulary respectively of getting, the maximum length vocabulary that matches is as template, adjacent when the monosyllabic word when running into this kind fructification, enlarge the coupling character length again, till matching another vocabulary, when running into can't mate commonly used vocabulary the time, getting four word length character strings is template;
The classification of this template of mark is the classification under this kind fructification;
All templates of obtaining are gathered as candidate template.
In the said method, described step 2 can comprise
Step 2-1) every text in each Twitter message text of training dataset is marked sequence as one, each mark unit of mark sequence is a Chinese character, utilizes external language knowledge vocabulary to extract external language knowledge feature respectively for each mark unit that marks sequence;
Step 2-2) uses the external language knowledge feature that extracts, adopt conditional random field models training named entity recognition device.
In the said method, described step 2-1) in, described external language knowledge vocabulary comprises: vocabulary 1, organizational structure's suffix vocabulary; Vocabulary 2, the place name summary sheet; Vocabulary 3, place name suffix table; Vocabulary 4, the surname table; Vocabulary 5, roster's name table commonly used; Vocabulary 6, the previous word table of name double word commonly used; Vocabulary 7, word table after the name double word commonly used; Vocabulary 8, the appellation tabulation; Vocabulary 9: microblogging famous person table (personage who adds the V authentication in the microblogging); Vocabulary 10, the common wordss table; Vocabulary 11, individual character vocabulary commonly used.
In the said method, at described step 2-1) in, for each mark unit, the external language knowledge feature that extract comprises:
Whether 1) current mark unit (each Chinese character) appears in any of vocabulary 1-vocabulary 11;
Whether 2) current mark unit appears in vocabulary 1, vocabulary 2, vocabulary 3, vocabulary 8, vocabulary 9 or the vocabulary 10 with a character that before connects;
3) whether current mark unit and two characters before connecing are in vocabulary 1, vocabulary 2, vocabulary 3, vocabulary 8, vocabulary 9 or vocabulary 10;
4) whether current mark unit and three characters before connecing are in vocabulary 1, vocabulary 2, vocabulary 8, vocabulary 9 or vocabulary 10;
5) whether current mark unit and four characters before connecing are in vocabulary 1, vocabulary 2 or vocabulary 10;
6) current mark unit appears in the vocabulary 5, and a preceding character that connects occurs in the vocabulary 4;
7) current mark unit appears in the vocabulary 6, and a preceding character that connects occurs in the vocabulary 4;
8) current mark unit appears in the vocabulary 7, and a preceding character that connects occurs in the vocabulary 6;
9) current mark unit appears in the vocabulary 7, and preceding two characters that connect, and appears at respectively in vocabulary 6 and the vocabulary 4;
10) whether the vocabulary that constitutes of current mark unit and two characters that before connect or a preceding character that connects is included in the vocabulary 9.
Another aspect the invention provides named entity recognition system in a kind of Twitter message, comprising:
10. named entity recognition system in the Twitter message, this system comprises:
Automatically annotation equipment is used for based on a small amount of named entity that is designated as seed, marks the microblogging of some automatically as training dataset from pending original Twitter message set;
Trainer is used for utilizing training dataset to train the named entity recognition device;
Recognition device, the named entity recognition device that its utilization trains is identified the named entity in the Twitter message.
Wherein, described trainer also is used for:
Every text in each Twitter message text of training dataset is marked sequence as one, each mark unit of mark sequence is a Chinese character, utilizes external language knowledge vocabulary to extract external language knowledge feature respectively for each mark unit that marks sequence; And
Use the external language knowledge feature that extracts, adopt conditional random field models training named entity recognition device.
Compared with prior art, the invention has the advantages that:
1, only needs to specify a small amount of existing kind of fructification, just can mark high-quality training set automatically.For this renewal speed of Twitter message text faster, significantly reduce cost of labor.
2, the mode with iteration progressively produces high-quality labeled data, each top n rebaptism entity of selecting to best embody named entity occurrence law in the true microblogging data adds to seed bank, and the final labeled data that generates can well cover whole microblogging data set.
3, conventional needle is launched on the basis of Chinese word segmentation mostly to the recognition methods of specification document, and microblogging text term is lack of standardization, comprise a large amount of abbreviations, do not meet grammatical term, ambiguity word and neologisms, the mode based on word is adopted in recognition methods of the present invention, has avoided the error accumulation that the participle mistake causes when named entity recognition.
Description of drawings
It is following that embodiments of the present invention is further illustrated with reference to accompanying drawing, wherein:
Fig. 1 is according to named entity recognition method schematic flow sheet in the Twitter message of the embodiment of the invention;
Fig. 2 marks the Twitter message text to produce the process synoptic diagram of training data automatically for the seed named entity that utilizes according to the embodiment of the invention;
Fig. 3 is for training the process synoptic diagram of named entity recognition device according to an embodiment of the invention.
Embodiment
In order to make purpose of the present invention, technical scheme and advantage are clearer, and the present invention is described in more detail by specific embodiment below in conjunction with accompanying drawing.Should be appreciated that specific embodiment described herein only in order to explaining the present invention, and be not used in restriction the present invention.
Fig. 1 has provided named entity recognition method in the Weakly supervised according to an embodiment of the invention Twitter message.This method comprises: step 1) specifies a spot of named entity as seed, automatically marks the microblogging data of some in the pending original Twitter message set (or original Twitter message database) as the training dataset of training named entity recognition device; Step 2) trains the name recognizer based on this training dataset; The name recognizer that the step 3) utilization trains is identified the named entity in the Twitter message.What wherein store in pending original Twitter message set or the database is through pretreated Twitter message text, pre-service to the Twitter message data of gathering can comprise the body text that extracts microblogging, filter out the special symbol of html tag, non-punctuate, full half-angle conversion that punctuation mark is unified etc.
In step 1, the microblogging data are marked automatically comprise and mark out named entity (also can the abbreviate entity as) border in the Twitter message text and the classification of named entity.Fig. 2 has provided and has utilized the seed named entity to mark the Twitter message text automatically to produce the process synoptic diagram of training data.As shown in Figure 2, step 1 comprises:
Step 11) specifies a spot of named entity as seed.For example, for name, place name and three named entity classifications of mechanism's name, adopt a spot of such other named entity example that belongs to as seed respectively.When the named entity of selecting as seed, can select the higher named entity of the frequency of occurrences in the microblogging data, thereby can make following step can more easily obtain the sentence that comprises this named entity, to improve marking efficiency.Listed some examples as all kinds of named entity examples of seed in the table 1, table 1 at be these data of 1 month on August 31,1 day to 2012 August in 2012 from Sina's microblogging is gathered.In table 1 for these three types of name, place name and mechanism's names given respectively 10 belong to such other named entity as seed.Named entity as seed also can be called kind of fructification or seed named entity.
Table 1
Figure BDA00003205051600071
Step 12), from initial seed named entity, iteration generates the generic rebaptism entity of high-quality template and high confidence level, deposits template base and seed bank respectively in; Wherein, described template be named entity regular length above and hereinafter, for example, the template of seed named entity " lindane " in sentence " top-seeded Chinese player lindane is poised for battle Japanese player Zuo Zuo wood Xiang as grand sports meet " can be " player # is poised for battle ".
Step 13), template base is combined into phrase in twos with similar named entity and template in the seed bank, in original Twitter message database, extract the Twitter message text that comprises this phrase, go out the border of named entity with template matches, the classification of this named entity of mark is the classification under the seed, has so just obtained the Twitter message text through mark.All that will obtain are named the training dataset of recognizer through the Twitter message text of mark as being used for training at last.
More specifically, according to one embodiment of present invention, from the initial seed named entity, iteration generates the generic rebaptism entity of high-quality template and high confidence level, deposits template base and named entity storehouse respectively in, may further comprise the steps in step 12):
Step 1-2-1), set template base P, seed bank S, initialization template base P be empty, initialization seed storehouse S is the specified name as seed, place name and mechanism's name example, routine seed named entity as shown in table 1.
Step 1-2-2), extract all and comprise in the seed bank microblogging statement of planting fructification, be designated as kind of a fructification and obtain statement set Ds, Ds will be for generation of the quality of template and assessment template;
Step 1-2-3), for obtain the kind fructification that occurs in each sentence among the statement set Ds in kind of fructification, get the context (comprising regular punctuation mark) of regular length before and after it, for example each four character length above and hereinafter; To above and hereinafter respectively with a Chinese everyday words vocabulary (vocabulary that in common document, often occurs just, for example, word, rural area etc.) mate, the maximum length vocabulary that matches is as template, when run into entity adjacent for monosyllabic word (for example, " ", " with ", " reaching " such word) time, enlarge the coupling character length again, till matching another vocabulary, when running into can't mate commonly used vocabulary the time, getting four word length character strings is template.The classification of this template of mark is the classification under the seed, is designated as candidate template set Pc.For example, be example with " Zhao Benshan " in the table 2, extracted " Zhao Benshan head the list of signers TV play rural area love story ", " Zhao Benshan participates in the Spring Festival Gala of Jiangsu satellite TV ", microblogging statements such as " seen Zhao Benshan's Spring Festival Gala essay, sensation are well ", can extract " Start# heads the list of signers ", " Start# participation ", " having seen the Spring Festival Gala of # " such template, wherein " Start " expression sentence is initial.
Step 1-2-4), obtain among the statement set Ds in kind of a fructification with all templates among the candidate template set Pc and oppositely extract named entity again, all results that note extracts extract string assemble St for template, the character string that in St, comprises three classes, one class is kind of a fructification, another kind of is new similar entity, a remaining class is the noise character string, because the second class novel entities negligible amounts, so can be according to the quality of the extraction situation analysis template of St in the step below.
For example, use Start# to head the list of signers ", " Start# participation ", templates such as " having seen the Spring Festival Gala of # " extracts entity again in Ds, and among the string assemble St that extracts, this is the first kind can to comprise " Zhao Benshan "; Also may comprise " Huang Hong ", this is second class; Also may comprise by " Start# participation " template character string " participating in tomorrow ... " the noise character string " tomorrow " that extracts, this is the 3rd class.So just need to analyze the extraction better quality of which template.
Step 1-2-5), extract the feature of template among the candidate template set Pc, analyze the quality of template, select the best top n template of comprehensive evaluation, add high-quality template base.Candidate template is gathered the feature that template is carried out quality analysis among the Pc, comprise the diversity of entity, accuracy rate, template frequency and the template intensity of entity, on each feature, template is sorted, select the best top n template of comprehensive evaluation, add high-quality template base.
Wherein, the feature of described template is calculated as follows:
1) quantity of the entity of entity diversity=inequality
That is to say that for a template in the candidate template set, the entity diversity refers to obtain the entity number inequality that extracts the statement set Ds with this template from kind of fructification.For example, " listening the song of # " with template is example, extracted " listening Chinese good sound; still very big very big in the song perceptual difference distance of listening that English ", " listen the song of Huang Qishan, that cries loses consciousness the sky secretly ", microbloggings such as " song of listening Huang Qishan are really to feel well; high pitch is said just last ", the entity inequality that extracts (that English, Huang Qishan) number is 2, so the entity diversity is 2.
2) described entity accuracy rate is calculated as follows
Figure BDA00003205051600091
Just, for each template, add up the sum of its entity that from the Ds set, extracts." having seen the Spring Festival Gala of # " with template is example, extracted " having seen Zhao Benshan's Spring Festival Gala essay ", " there is nobody to see the Spring Festival Gala of Jiangsu satellite TV ", microblogging such as " seen the Spring Festival Gala of lichee platform; laughed at also and cried ", all entities that this template extracts add up to 3, are respectively Zhao Benshan, Jiangsu satellite TV and lichee platform, the physical quantities that appears in the seed set (being seed bank) is that 1(is Zhao Benshan), so the entity accuracy rate is 1/3.
3) described template frequency computation part is as follows
Figure BDA00003205051600092
4) described template intensity is calculated as follows
Figure BDA00003205051600093
For certain template, the quantity of the classification that template occurred refers to that the type of the named entity that extracts in this template has several.For example, still " having seen the Spring Festival Gala of # " with above-mentioned template is example, has extracted above-mentioned three microbloggings, and wherein Zhao Benshan is name, and Jiangsu satellite TV and lichee platform are mechanism's names, so the quantity of the classification that template occurred is 2.Have several and total categorical measure of entity refers to the classification of the named entity that comprises in the seed bank.
In conjunction with above-mentioned feature the quality of each template is carried out comprehensive evaluation, select the top-quality top n template of comprehensive evaluation.After for example the value of certain all feature of template can being carried out normalization, get the product of all features, be used as the comprehensive evaluation index of this template.Perhaps also can be with the weighted array of all features of certain template comprehensive evaluation index as this template, weight arranges according to real data environment or system requirements.Certainly, above-mentioned feature only is illustrational purpose, also can adopt arbitrary combination of above-mentioned feature in other embodiments.
Step 1-2-6), in original microblogging database, extract named entity in all Twitter message data that do not mark with high-quality template in the template base, all results of extracting of note are candidate's entity sets Ec.
For example, be example with template " player # be poised for battle ", directly by string matching, in whole original microblogging database, can match such as " the pretty sunshine of Chinese player fourth is poised for battle England player and is filled in Bill ", microbloggings such as " [world's Snooker match among the masters] Chinese player Zhou Yuelong are poised for battle White ", therefore the named entity that extracts is " Ding Junhui " and " Zhou Yuelong ".
Step 1-2-7), utilize the degree of confidence of candidate's named entity and the high-quality template cooccurrence relation calculated candidate named entity in original Twitter message database, select preceding M high entity of degree of confidence to add in the seed bank.This step can be controlled noise preferably
Figure BDA00003205051600101
That is to say that the degree of confidence of candidate's named entity equals this named entity and extract the product of number of times that number of times that the template of this named entity occurs simultaneously occurs divided by this named entity and this template number of times of appearance in original Twitter message set in original Twitter message set in original Twitter message set.
Step 1-2-8), repeating step 1-2-2) to step 1-2-7), no longer enlarge or reach default iterations up to seed bank.In above-mentioned steps, M and N are the integer greater than 1, can set according to user or system's actual demand.
Step 13), for the expansion after seed bank and any two tuples<seed, the template in the template base, be combined into phrase in twos, in original Twitter message database, extract the Twitter message text that comprises this phrase, go out the border of named entity with template matches, the classification of this named entity of mark is the classification under the seed, the final training data that meets the microblogging data characteristics that generates.These training datas are the Twitter message text that comprises high-quality template and high-quality kind fructification, and it will be as the mark language material of training named entity recognition device.
For example the name entity " Liu Huan " in the seed bank and the name template in the template base " are listened the song of # ", be combined into phrase and " listen the song of Liu Huan ", in original microblogging database, can extract microblogging " ... the song of listening Liu Huan after growing up is again seen TV " the Water Margin " ... ", the name entity border of " listening the song of # " and can locating in this microblogging with template is " Liu Huan ", and the classification of mark-up entity " Liu Huan " is name.
Step 2, use the Twitter message data of the automatic mark of step 1 as training dataset, the named entity recognition device that utilizes this training dataset to train to be applicable to Twitter message just can utilize the named entity recognition device that trains to identify named entity in the Twitter message then.Wherein, the named entity recognition device can wait to train based on Hidden Markov Model (HMM), maximum entropy model, conditional random field models.The feature that is used for training named entity recognition device can be determined according to real data environment and system requirements.But the feature that adopts can influence accuracy of identification and the efficient of the name recognizer of training.
In a preferred embodiment of the invention, for accuracy of identification and the efficient that improves the named entity recognition device better, in step 2, use the Twitter message data of the automatic mark of step 1 as training set, extract the multilingual knowledge feature of training data, come the training condition random field (Conditional Random Field, CRF) model, thereby obtain being applicable to the named entity recognition device of Twitter message.Wherein, the multilingual knowledge feature of training data can extract according to the feature extraction array configuration that designs from the linguistry of multiple outside.
Fig. 3 has provided the process synoptic diagram of training the named entity recognition device according to an embodiment of the invention.Wherein, described external language knowledge can show as the form of external language knowledge vocabulary.Described step 2 mainly can comprise:
Step 2-1) every text in each Twitter message text of training set is marked sequence as one, each mark unit of mark sequence is a Chinese character, utilizes external language knowledge vocabulary to carry out feature extraction respectively for each mark unit that marks sequence.When running into numeral or foreign language character string, whole numeric string or foreign language character string are a mark unit.Every text refers to the character string of being separated by comma, fullstop, exclamation or branch, when a plurality of such Symbolic Links, it is considered as a sentence list separator.
The external language knowledge vocabulary that adopts can comprise: vocabulary 1-organizational structure suffix vocabulary; Vocabulary 2-place name summary sheet; Vocabulary 3-place name suffix table; Vocabulary 4-surname table; Vocabulary 5-roster's name table commonly used; Vocabulary 6-previous the word table of name double word commonly used; Word table after vocabulary 7-name double word commonly used; The tabulation of vocabulary 8-appellation; Vocabulary 9: microblogging famous person table (personage who adds the V authentication in the microblogging); Vocabulary 10-common wordss table; Vocabulary 11-individual character vocabulary commonly used.
For each mark unit, the external language knowledge feature that extract comprises:
Whether 1) current mark unit (each Chinese character) appears in any of vocabulary 1~vocabulary 11.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".
Whether 2) current mark unit appears in vocabulary 1, vocabulary 2, vocabulary 3, vocabulary 8, vocabulary 9 or the vocabulary 10 with a character that before connects.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".
3) whether current mark unit and two characters before connecing are in vocabulary 1, vocabulary 2, vocabulary 3, vocabulary 8, vocabulary 9 or vocabulary 10.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".
4) whether current mark unit and three characters before connecing are in vocabulary 1, vocabulary 2, vocabulary 8, vocabulary 9 or vocabulary 10.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".
5) whether current mark unit and four characters before connecing are in vocabulary 1, vocabulary 2 or vocabulary 10.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".
6) current mark unit appears in the vocabulary 5, and a preceding character that connects occurs in the vocabulary 4.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".
7) current mark unit appears in the vocabulary 6, and a preceding character that connects occurs in the vocabulary 4.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".
8) current mark unit appears in the vocabulary 7, and a preceding character that connects occurs in the vocabulary 6.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".
9) current mark unit appears in the vocabulary 7, and preceding two characters that connect, and appears at respectively in vocabulary 6 and the vocabulary 4.If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".
10) whether the vocabulary that constitutes of current mark unit and two characters that before connect or a preceding character that connects is included in the vocabulary 9 and (adopts the longest coupling).If, the then desirable "True" of this feature or " 1 ", otherwise desirable " vacation " or " 0 ".
Like this, for each mark unit, can obtain the proper vector of being formed by 10 elements.With the Twitter message text " add up according to Haidian tax bureau .... " be example, for character " office " wherein, the vocabulary " tax bureau " (namely appearing in the table 1) that it meets current mark unit and two characters before connecing is vocabulary 1, so to mark unit " office " 3) feature is taken as very; This character " office " also appears in any of vocabulary 1~vocabulary 11, therefore for so to mark unit " office " 1) feature is taken as very; In this way, the proper vector that can obtain marking unit " office " is (1,1,1,0,0,0,0,0,0,0).In this way, for each Twitter message text, can obtain the eigenmatrix of being formed by the proper vector of its contained all characters.
Step 2-2) uses the external language knowledge feature that extracts, adopt conditional random field models training named entity recognition device.
Training mainly comprises based on the process of the named entity recognition device of CRF model: at first, Twitter message text for mark, wherein a label given in each character, adopt four class label Bs, I, E, O, B represents the beginning character of entity, I represents the entity intermediate character, E represents the entity last character, and O represents non-entity character.For example mark " see/O/Zhao O/Bp basis/Ip mountain/Ep /O spring/O evening/O is little/O product/O ", wherein Bp, Ip, Ep represent beginning, the centre of name entity and finish, just know the position of entity in the sentence like this according to mark, training named entity recognition model is exactly to carry out parameter estimation according to the Twitter message of such mark.Then, according to above-mentioned external language knowledge feature extraction method, labeled data is carried out after the feature extraction, the Twitter message data of every mark are expressed as combination of features, and the input of the characteristic set of all mark Twitter messages as the CRF model.Training process is the weight of estimating each fundamental function according to maximum likelihood method.After the weight of having obtained all fundamental functions, just mean to have obtained the named entity recognition device that trains.
Then, for Twitter message to be identified, the external language knowledge feature of utilizing external language knowledge vocabulary to extract this Twitter message as indicated above, then, the named entity recognition device that utilization trains, adopt viterbi algorithm to calculate it and have the mark sequence results of maximum probability, thereby identify the named entity in this Twitter message.
In yet another embodiment of the present invention, also provide named entity recognition system in a kind of Twitter message, this system comprises automatic annotation equipment, trainer and recognition device.Wherein, annotation equipment adopts method mentioned above based on a small amount of named entity that is designated as seed automatically, marks the microblogging of some automatically as training dataset from pending original Twitter message set.Trainer adopts method mentioned above to utilize training dataset to train the named entity recognition device.Recognition device utilizes the named entity recognition device that trains that the named entity in the Twitter message is identified.
For verifying the performance of this method, the inventor also tests, and is example with Sina's microblogging, in all Twitter messages that collect, chooses the data (about 5,000 ten thousand) of this month on the 31st in 1 day to 2012 August of August in 2012 as the target data set of experiment.Use Weakly supervised method of the present invention to carry out named entity recognition at this data set.The named entity classification of identification is name, place name and mechanism's name.In order to estimate recognition effect of the present invention, picked at random part Twitter message manually marks, as test data set.The details of test data set have been listed in the table 2.
Table 2
The microblogging sum Name quantity Place name quantity Mechanism's name quantity
15379 1654 1556 1055
In conjunction with method mentioned above and given object of experiment data set, specifically experiment and test process are as follows:
1), the Twitter message that collects is carried out pre-service, extracts body text, filter out the character beyond the regular punctuation marks of Chinese such as special character, html tag, deposit original microblogging database in;
2), for name, place name and three named entity classifications of mechanism's name, given 10 belong to such other example as seed respectively, as mentioned listed all kinds of kinds of fructifications in the table 1;
3), adopt automatic annotation step mentioned above, in original microblogging database, pick out the microblogging data of a part automatically, mark automatically;
4), utilize after organizational structure's vocabulary mentioned above, place name summary sheet, place name suffix table, surname table, roster's name table commonly used, the name double word commonly used a word table, the previous word table of name double word commonly used, appellation tabulation, common wordss table, 11 vocabularys of individual character vocabulary commonly used, the external language knowledge feature in the microblogging data that further extraction has marked.
5), the microblogging data of using mark automatically are as training set, in conjunction with a plurality of features of extracting, train the CRF model, to obtain being applicable to the named entity recognition device of microblogging.
To recognition effect evaluating standard of the present invention be: precision ratio and recall ratio, and the F1 value of considering precision ratio and recall ratio simultaneously.Table 3 has been listed the recognition effect of named entity recognition method on name, place name and mechanism's name according to the embodiment of the invention.
(1) quantity of the precision ratio proper naming entity that equals to identify is divided by the quantity of all named entities that identify.
(2) quantity of the recall ratio proper naming entity that equals to identify is divided by the total quantity of all named entities that comprise in the Twitter message.
Figure BDA00003205051600141
Table 3
Classification Precision ratio Recall ratio The F1 value
Name 85.8% 81.3% 83.5%
Place name 87.2% 84.7% 85.9%
Mechanism's name 80.1% 78.4% 79.2%
Specific embodiment by above as can be seen, compare with traditional named entity recognition method that supervision is arranged, microblogging named entity recognition method provided by the invention has been saved the cost of artificial labeled data, only needs to specify a small amount of existing kind of fructification, just can mark high-quality training set automatically.For this renewal speed of Twitter message text faster, significantly reduce cost of labor.And, the present invention progressively produces high-quality labeled data with the mode of iteration, each select to best embody that preceding M rebaptism entity of named entity occurrence law adds to seed bank in the true microblogging data, the labeled data of final generation can well cover whole microblogging data set.In addition, traditional recognition methods launches on the basis of Chinese word segmentation mostly, and the microblogging term is lack of standardization, comprise a large amount of abbreviations, do not meet grammatical term, ambiguity word and neologisms, it is the mode of unit that method of the present invention adopts with the word when training name recognizer, has avoided the error accumulation that the participle mistake causes when named entity recognition.
Though the present invention is described by preferred embodiment, yet the present invention is not limited to embodiment as described herein, also comprises various changes and the variation done without departing from the present invention.

Claims (11)

1. named entity recognition method in the Twitter message, this method comprises:
Step 1 specifies a small amount of named entity as seed, marks the microblogging of some automatically as training dataset from pending original Twitter message set;
Step 2 is trained the named entity recognition device with this training dataset;
Step 3 utilizes the named entity recognition device that trains that the named entity in the Twitter message is identified.
2. method according to claim 1, described step 1 comprises:
Step 11) adopts a spot of such other named entity example that belongs to as initial kind fructification respectively for name, place name and three named entity classifications of mechanism's name;
Step 12), initial kind fructification is added seed bank, iteration generates template and generic rebaptism entity, deposits template base and seed bank respectively in; Wherein, described template be named entity regular length above and hereinafter;
Step 13), the named entity in template base and the seed bank and template are combined into phrase in twos, in original Twitter message set, extract the Twitter message text that comprises this phrase, go out the border of named entity with template matches, and the classification of this named entity of mark is the classification under the seed, thereby obtains the Twitter message text through mark; And with obtain all through the mark Twitter message texts as training dataset.
3. method according to claim 2 generates template and generic rebaptism entity in iteration described in the described step 12), deposits template base in respectively and seed bank comprises:
Step (12-1) extracts all and comprises the microblogging statement of planting fructification in the seed bank, obtains the statement set as kind of a fructification;
Step (12-2) is obtained the kind fructification that occurs in each sentence in the statement set for kind of fructification, get its regular length above with hereinafter as template, the classification of this template of mark be the affiliated classification of this kind fructification; And all templates of obtaining are gathered as candidate template;
Step (12-3) is chosen top-quality top n template and is added template base from the candidate template set;
Step (12-4) extracts named entity in all microblogging data that do not mark with the template in the template base in original Twitter message set;
Step (12-5) selects preceding M high named entity of degree of confidence to add in the seed bank from all named entities that extract;
Step (12-6) repeating step (12-1) to (12-5) no longer enlarges or reaches predefined iterations up to seed bank, and wherein M and N are the integer greater than 1.
4. method according to claim 3, described step (12-3) comprising:
Extract the feature of each template in the candidate template set, the feature of described template comprises accuracy rate, template frequency and the template intensity of the diversity of entity, entity;
On each feature, template is sorted, select the best top n template of comprehensive evaluation, add template base;
Wherein, the diversity of the entity number of obtaining the named entity inequality that extracts the statement set from kind of fructification with template characterizes;
The entity accuracy rate equals quantity that template obtains the named entity in the seed bank that extracts the statement set from kind of fructification and template and obtains the ratio of the sum of the named entity of extraction the statement set from kind of fructification;
The template frequency equals template and obtains the number of times that occurs in the statement set and plant the ratio that fructification is obtained the template sum that comprises in the statement set in kind of fructification;
The template intensity is calculated as follows:
Figure FDA00003205051500021
Wherein, the quantity of the classification that occurred of template refers in this template that the classification of obtaining the named entity that extracts the statement set from kind of fructification has several; Total categorical measure of entity is that the classification of the named entity that comprises in the seed bank has several.
5. method according to claim 3, in the described step (12-5), the degree of confidence of described named entity equals this named entity and extract the product of number of times that number of times that the template of this named entity occurs simultaneously occurs divided by this named entity and this template number of times of appearance in original Twitter message set in original Twitter message set in original Twitter message set.
6. method according to claim 3, described step (12-2) comprising:
Obtain the kind fructification that occurs in each sentence in the statement set for kind of fructification, get each four character length before and after it above and hereinafter;
To above and hereinafter mating with Chinese everyday words vocabulary respectively of getting, the maximum length vocabulary that matches is as template, adjacent when the monosyllabic word when running into this kind fructification, enlarge the coupling character length again, till matching another vocabulary, when running into can't mate commonly used vocabulary the time, getting four word length character strings is template;
The classification of this template of mark is the classification under this kind fructification;
All templates of obtaining are gathered as candidate template.
7. according to each described method among the claim 1-6, described step 2 comprises
Step 2-1) every text in each Twitter message text of training dataset is marked sequence as one, each mark unit of mark sequence is a Chinese character, utilizes external language knowledge vocabulary to extract external language knowledge feature respectively for each mark unit that marks sequence;
Step 2-2) uses the external language knowledge feature that extracts, adopt conditional random field models training named entity recognition device.
8. method according to claim 7, described step 2-1) in, described external language knowledge vocabulary comprises: vocabulary 1, organizational structure's suffix vocabulary; Vocabulary 2, the place name summary sheet; Vocabulary 3, place name suffix table; Vocabulary 4, the surname table; Vocabulary 5, roster's name table commonly used; Vocabulary 6, the previous word table of name double word commonly used; Vocabulary 7, word table after the name double word commonly used; Vocabulary 8, the appellation tabulation; Vocabulary 9, microblogging famous person table; Vocabulary 10, the common wordss table; Vocabulary 11, individual character vocabulary commonly used.
9. method according to claim 8 is at described step 2-1) in, for each mark unit, the external language knowledge feature that extract comprises:
Whether a) current mark unit appears in any of vocabulary 1-vocabulary 11;
Whether b) current mark unit appears in vocabulary 1, vocabulary 2, vocabulary 3, vocabulary 8, vocabulary 9 or the vocabulary 10 with a character that before connects;
C) whether current mark unit and two characters before connecing are in vocabulary 1, vocabulary 2, vocabulary 3, vocabulary 8, vocabulary 9 or vocabulary 10;
D) whether current mark unit and three characters before connecing are in vocabulary 1, vocabulary 2, vocabulary 8, vocabulary 9 or vocabulary 10;
E) whether current mark unit and four characters before connecing are in vocabulary 1, vocabulary 2 or vocabulary 10;
F) current mark unit appears in the vocabulary 5, and a preceding character that connects occurs in the vocabulary 4;
G) current mark unit appears in the vocabulary 6, and a preceding character that connects occurs in the vocabulary 4;
H) current mark unit appears in the vocabulary 7, and a preceding character that connects occurs in the vocabulary 6;
I) current mark unit appears in the vocabulary 7, and preceding two characters that connect, and appears at respectively in vocabulary 6 and the vocabulary 4;
J) whether the vocabulary that constitutes of current mark unit and two characters that before connect or a preceding character that connects is included in the vocabulary 9.
10. named entity recognition system in the Twitter message, this system comprises:
Automatically annotation equipment is used for based on a small amount of named entity that is designated as seed, marks the microblogging of some automatically as training dataset from pending original Twitter message set;
Trainer is used for utilizing training dataset to train the named entity recognition device;
Recognition device, the named entity recognition device that its utilization trains is identified the named entity in the Twitter message.
11. system according to claim 10, described trainer also is used for:
Every text in each Twitter message text of training dataset is marked sequence as one, each mark unit of mark sequence is a Chinese character, utilizes external language knowledge vocabulary to extract external language knowledge feature respectively for each mark unit that marks sequence; And
Use the external language knowledge feature that extracts, adopt conditional random field models training named entity recognition device.
CN201310182978.XA 2013-05-17 2013-05-17 Named entity recognition method and system in Twitter message Active CN103268339B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310182978.XA CN103268339B (en) 2013-05-17 2013-05-17 Named entity recognition method and system in Twitter message

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310182978.XA CN103268339B (en) 2013-05-17 2013-05-17 Named entity recognition method and system in Twitter message

Publications (2)

Publication Number Publication Date
CN103268339A true CN103268339A (en) 2013-08-28
CN103268339B CN103268339B (en) 2016-06-01

Family

ID=49011968

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310182978.XA Active CN103268339B (en) 2013-05-17 2013-05-17 Named entity recognition method and system in Twitter message

Country Status (1)

Country Link
CN (1) CN103268339B (en)

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103853710A (en) * 2013-11-21 2014-06-11 北京理工大学 Coordinated training-based dual-language named entity identification method
CN104133848A (en) * 2014-07-01 2014-11-05 中央民族大学 Tibetan language entity knowledge information extraction method
CN104636466A (en) * 2015-02-11 2015-05-20 中国科学院计算技术研究所 Entity attribute extraction method and system oriented to open web page
CN104679885A (en) * 2015-03-17 2015-06-03 北京理工大学 User search string organization name recognition method based on semantic feature model
CN105224520A (en) * 2015-09-28 2016-01-06 北京信息科技大学 A kind of Chinese patent documentation term automatic identifying method
CN106104522A (en) * 2014-03-18 2016-11-09 微软技术许可有限责任公司 The entity platform of name and storage
CN106407183A (en) * 2016-09-28 2017-02-15 医渡云(北京)技术有限公司 Method and device for generating medical named entity recognition system
CN106503192A (en) * 2016-10-31 2017-03-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
CN107168946A (en) * 2017-04-14 2017-09-15 北京化工大学 A kind of name entity recognition method of medical text data
CN107291700A (en) * 2017-07-17 2017-10-24 广州特道信息科技有限公司 Entity word recognition method and device
CN107329951A (en) * 2017-06-14 2017-11-07 深圳市牛鼎丰科技有限公司 Build name entity mark resources bank method, device, storage medium and computer equipment
CN108170708A (en) * 2017-11-23 2018-06-15 杭州大搜车汽车服务有限公司 A kind of vehicle entity recognition method, electronic equipment, storage medium, system
CN108255816A (en) * 2018-03-12 2018-07-06 北京神州泰岳软件股份有限公司 A kind of name entity recognition method, apparatus and system
CN108304375A (en) * 2017-11-13 2018-07-20 广州腾讯科技有限公司 A kind of information identifying method and its equipment, storage medium, terminal
CN108763218A (en) * 2018-06-04 2018-11-06 四川长虹电器股份有限公司 A kind of video display retrieval entity recognition method based on CRF
CN108776656A (en) * 2018-06-05 2018-11-09 南京农业大学 Food safety affair entity abstracting method based on condition random field
CN108959256A (en) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 Generation method, device, storage medium and the terminal device of short text
CN109740159A (en) * 2018-12-29 2019-05-10 北京泰迪熊移动科技有限公司 For naming the processing method and processing device of Entity recognition
CN109753976A (en) * 2017-11-01 2019-05-14 中国电信股份有限公司 Corpus labeling device and method
CN109977391A (en) * 2017-12-28 2019-07-05 ***通信集团公司 A kind of information extraction method and device of text data
CN110020120A (en) * 2017-10-10 2019-07-16 腾讯科技(北京)有限公司 Feature word treatment method, device and storage medium in content delivery system
CN110059163A (en) * 2019-04-29 2019-07-26 百度在线网络技术(北京)有限公司 Generate method and apparatus, the electronic equipment, computer-readable medium of template
CN110457436A (en) * 2019-07-30 2019-11-15 腾讯科技(深圳)有限公司 Information labeling method, apparatus, computer readable storage medium and electronic equipment
CN110795941A (en) * 2019-10-26 2020-02-14 创新工场(广州)人工智能研究有限公司 Named entity identification method and system based on external knowledge and electronic equipment
CN111079435A (en) * 2019-12-09 2020-04-28 深圳追一科技有限公司 Named entity disambiguation method, device, equipment and storage medium
CN111259669A (en) * 2018-11-30 2020-06-09 阿里巴巴集团控股有限公司 Information labeling method, information processing method and device
CN111274821A (en) * 2020-02-25 2020-06-12 北京明略软件***有限公司 Named entity identification data labeling quality evaluation method and device
WO2021000491A1 (en) * 2019-07-03 2021-01-07 平安科技(深圳)有限公司 Question entity recognition and linking method and apparatus, computer device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923544A (en) * 2009-06-15 2010-12-22 北京百分通联传媒技术有限公司 Method for monitoring and displaying Internet hot spots
CN102609475A (en) * 2012-01-19 2012-07-25 浙江省公众信息产业有限公司 Method for monitoring content of microblog and monitoring system
CN102662965A (en) * 2012-03-07 2012-09-12 上海引跑信息科技有限公司 Method and system of automatically discovering hot news theme on the internet
US20130124964A1 (en) * 2011-11-10 2013-05-16 Microsoft Corporation Enrichment of named entities in documents via contextual attribute ranking

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923544A (en) * 2009-06-15 2010-12-22 北京百分通联传媒技术有限公司 Method for monitoring and displaying Internet hot spots
US20130124964A1 (en) * 2011-11-10 2013-05-16 Microsoft Corporation Enrichment of named entities in documents via contextual attribute ranking
CN102609475A (en) * 2012-01-19 2012-07-25 浙江省公众信息产业有限公司 Method for monitoring content of microblog and monitoring system
CN102662965A (en) * 2012-03-07 2012-09-12 上海引跑信息科技有限公司 Method and system of automatically discovering hot news theme on the internet

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BRIAN LOCKE等: "Named Entity Recognition: Adapting to Microblogging", 《COMPUTER SCIENCE UNDERGRADUTAE CONTRIBUTIONS》 *
GU XU等: "Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation", 《PROCEEDINGS OF THE 15TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING》 *

Cited By (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103853710A (en) * 2013-11-21 2014-06-11 北京理工大学 Coordinated training-based dual-language named entity identification method
CN106104522A (en) * 2014-03-18 2016-11-09 微软技术许可有限责任公司 The entity platform of name and storage
CN106104522B (en) * 2014-03-18 2019-07-16 微软技术许可有限责任公司 For reinforcing the method, system and computer memory device of any user content
CN104133848A (en) * 2014-07-01 2014-11-05 中央民族大学 Tibetan language entity knowledge information extraction method
CN104636466A (en) * 2015-02-11 2015-05-20 中国科学院计算技术研究所 Entity attribute extraction method and system oriented to open web page
CN104679885A (en) * 2015-03-17 2015-06-03 北京理工大学 User search string organization name recognition method based on semantic feature model
CN105224520A (en) * 2015-09-28 2016-01-06 北京信息科技大学 A kind of Chinese patent documentation term automatic identifying method
CN106407183A (en) * 2016-09-28 2017-02-15 医渡云(北京)技术有限公司 Method and device for generating medical named entity recognition system
CN106407183B (en) * 2016-09-28 2019-06-28 医渡云(北京)技术有限公司 Medical treatment name entity recognition system generation method and device
CN106503192B (en) * 2016-10-31 2019-10-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
CN106503192A (en) * 2016-10-31 2017-03-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
CN107168946A (en) * 2017-04-14 2017-09-15 北京化工大学 A kind of name entity recognition method of medical text data
CN107329951A (en) * 2017-06-14 2017-11-07 深圳市牛鼎丰科技有限公司 Build name entity mark resources bank method, device, storage medium and computer equipment
CN107291700A (en) * 2017-07-17 2017-10-24 广州特道信息科技有限公司 Entity word recognition method and device
CN110020120B (en) * 2017-10-10 2023-11-10 腾讯科技(北京)有限公司 Feature word processing method, device and storage medium in content delivery system
CN110020120A (en) * 2017-10-10 2019-07-16 腾讯科技(北京)有限公司 Feature word treatment method, device and storage medium in content delivery system
CN109753976B (en) * 2017-11-01 2021-03-19 中国电信股份有限公司 Corpus labeling device and method
CN109753976A (en) * 2017-11-01 2019-05-14 中国电信股份有限公司 Corpus labeling device and method
CN108304375A (en) * 2017-11-13 2018-07-20 广州腾讯科技有限公司 A kind of information identifying method and its equipment, storage medium, terminal
CN108304375B (en) * 2017-11-13 2022-01-07 广州腾讯科技有限公司 Information identification method and equipment, storage medium and terminal thereof
CN108170708B (en) * 2017-11-23 2021-03-30 杭州大搜车汽车服务有限公司 Vehicle entity identification method, electronic equipment, storage medium and system
CN108170708A (en) * 2017-11-23 2018-06-15 杭州大搜车汽车服务有限公司 A kind of vehicle entity recognition method, electronic equipment, storage medium, system
CN109977391A (en) * 2017-12-28 2019-07-05 ***通信集团公司 A kind of information extraction method and device of text data
CN109977391B (en) * 2017-12-28 2020-12-08 ***通信集团公司 Information extraction method and device for text data
CN108255816A (en) * 2018-03-12 2018-07-06 北京神州泰岳软件股份有限公司 A kind of name entity recognition method, apparatus and system
CN108763218A (en) * 2018-06-04 2018-11-06 四川长虹电器股份有限公司 A kind of video display retrieval entity recognition method based on CRF
CN108776656A (en) * 2018-06-05 2018-11-09 南京农业大学 Food safety affair entity abstracting method based on condition random field
CN108959256B (en) * 2018-06-29 2023-04-07 北京百度网讯科技有限公司 Short text generation method and device, storage medium and terminal equipment
CN108959256A (en) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 Generation method, device, storage medium and the terminal device of short text
CN111259669B (en) * 2018-11-30 2023-06-27 阿里巴巴集团控股有限公司 Information labeling method, information processing method and information processing device
CN111259669A (en) * 2018-11-30 2020-06-09 阿里巴巴集团控股有限公司 Information labeling method, information processing method and device
CN109740159A (en) * 2018-12-29 2019-05-10 北京泰迪熊移动科技有限公司 For naming the processing method and processing device of Entity recognition
CN109740159B (en) * 2018-12-29 2022-04-26 北京泰迪熊移动科技有限公司 Processing method and device for named entity recognition
CN110059163B (en) * 2019-04-29 2022-05-13 百度在线网络技术(北京)有限公司 Method and device for generating template, electronic equipment and computer readable medium
CN110059163A (en) * 2019-04-29 2019-07-26 百度在线网络技术(北京)有限公司 Generate method and apparatus, the electronic equipment, computer-readable medium of template
WO2021000491A1 (en) * 2019-07-03 2021-01-07 平安科技(深圳)有限公司 Question entity recognition and linking method and apparatus, computer device and storage medium
CN110457436A (en) * 2019-07-30 2019-11-15 腾讯科技(深圳)有限公司 Information labeling method, apparatus, computer readable storage medium and electronic equipment
CN110457436B (en) * 2019-07-30 2022-12-27 腾讯科技(深圳)有限公司 Information labeling method and device, computer readable storage medium and electronic equipment
CN110795941A (en) * 2019-10-26 2020-02-14 创新工场(广州)人工智能研究有限公司 Named entity identification method and system based on external knowledge and electronic equipment
CN110795941B (en) * 2019-10-26 2024-04-05 创新工场(广州)人工智能研究有限公司 Named entity identification method and system based on external knowledge and electronic equipment
CN111079435A (en) * 2019-12-09 2020-04-28 深圳追一科技有限公司 Named entity disambiguation method, device, equipment and storage medium
CN111274821A (en) * 2020-02-25 2020-06-12 北京明略软件***有限公司 Named entity identification data labeling quality evaluation method and device
CN111274821B (en) * 2020-02-25 2024-04-26 北京明略软件***有限公司 Named entity identification data labeling quality assessment method and device

Also Published As

Publication number Publication date
CN103268339B (en) 2016-06-01

Similar Documents

Publication Publication Date Title
CN103268339B (en) Named entity recognition method and system in Twitter message
CN109977413B (en) Emotion analysis method based on improved CNN-LDA
CN106503192B (en) Name entity recognition method and device based on artificial intelligence
CN111767741B (en) Text emotion analysis method based on deep learning and TFIDF algorithm
CN108628833B (en) Method and device for determining summary of original content and method and device for recommending original content
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN106294593B (en) In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study
CN101599071B (en) Automatic extraction method of conversation text topic
CN102831177B (en) Statement error correction and system thereof
CN110188351A (en) The training method and device of sentence smoothness degree and syntactic score model
CN109637537B (en) Method for automatically acquiring annotated data to optimize user-defined awakening model
CN110008335A (en) The method and device of natural language processing
CN107799116A (en) More wheel interacting parallel semantic understanding method and apparatus
CN103869998B (en) A kind of method and device being ranked up to candidate item caused by input method
CN109949799B (en) Semantic parsing method and system
CN110134799B (en) BM25 algorithm-based text corpus construction and optimization method
CN102253976B (en) Metadata processing method and system for spoken language learning
CN109508441B (en) Method and device for realizing data statistical analysis through natural language and electronic equipment
CN112989802B (en) Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium
CN100511214C (en) Method and system for abstracting batch single document for document set
CN109299272B (en) Large-information-quantity text representation method for neural network input
CN101645083A (en) Acquisition system and method of text field based on concept symbols
CN109299277A (en) The analysis of public opinion method, server and computer readable storage medium
Lee et al. Personalizing recurrent-neural-network-based language model by social network
Hillard et al. Learning weighted entity lists from web click logs for spoken language understanding

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20130828

Assignee: Branch DNT data Polytron Technologies Inc

Assignor: Institute of Computing Technology, Chinese Academy of Sciences

Contract record no.: 2018110000033

Denomination of invention: Recognition method and system of named entities in microblog messages

Granted publication date: 20160601

License type: Common License

Record date: 20180807

EE01 Entry into force of recordation of patent licensing contract