CN102103594A - Character data recognition and processing method and device - Google Patents

Character data recognition and processing method and device Download PDF

Info

Publication number
CN102103594A
CN102103594A CN2009102429754A CN200910242975A CN102103594A CN 102103594 A CN102103594 A CN 102103594A CN 2009102429754 A CN2009102429754 A CN 2009102429754A CN 200910242975 A CN200910242975 A CN 200910242975A CN 102103594 A CN102103594 A CN 102103594A
Authority
CN
China
Prior art keywords
character data
feature
benchmark
frequency
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2009102429754A
Other languages
Chinese (zh)
Inventor
赵立红
万小军
吴於茜
杨建武
肖建国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING FOUNDER E-GOVERNMENT INFORMATION TECHNOLOGY Co Ltd
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Original Assignee
BEIJING FOUNDER E-GOVERNMENT INFORMATION TECHNOLOGY Co Ltd
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING FOUNDER E-GOVERNMENT INFORMATION TECHNOLOGY Co Ltd, Peking University, Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical BEIJING FOUNDER E-GOVERNMENT INFORMATION TECHNOLOGY Co Ltd
Priority to CN2009102429754A priority Critical patent/CN102103594A/en
Publication of CN102103594A publication Critical patent/CN102103594A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a character data recognition and processing method and a character data recognition and processing device. The method provided by the invention comprises the following steps of: recognizing featured character data according to reference linguistic data and a reference template, and obtaining different entity names corresponding to each named entity respectively; obtaining the feature affix frequency of each entity name; recognizing character data to be processed according to the feature affix frequency, the reference template and predefined linguistic data to obtain the different entity names corresponding to each named entity respectively; and performing subsequent analysis processing by taking the entity names recognized from the character data to be processed as data parameters. Feature affixes form a recognition feature column, so the method and the system solve the problem of relatively greater predefined character data recognition errors in post-retrieval and translation, improve the recognition accuracy of the named entities and avoid freely or insufficiently normally expressed named entities not being recognized or being recognized by error.

Description

Character data identification and the method and apparatus of handling
Technical field
The present invention relates to the technical field of computerized data retrieval, in particular to the method and apparatus of character data identification and processing.
Background technology
The internet has obtained fast development since being born the beginning of the nineties in last century, its information issue mainly realizes with the form of webpage.According to up-to-date estimation, the quantity of webpage has surpassed 550 billion (1 billion equals 1,000,000,000) in the internet, and the internet has covered the every field of real world as maximum in the world information warehouse.In the face of this magnanimity information source, people press for some automatic tools and help them to find real important information rapidly, so information extraction research is arisen at the historic moment.The fundamental purpose of information extraction is that structureless text is converted into structuring or semi-structured information, and stores with specific forms, for user inquiring or further analysis and utilization.And named entity recognition has progressively become a gordian technique of natural language processing as one of them basic step.
Named entity (Named Entity) is the concrete or abstract entity in the real world.Mainly comprise entity, time expression formula, numeral expression formula etc.In concrete the application, the concrete implication of named entity also need depend on the circumstances; For example, may need address, network address, e-mail address, telephone number, naval vessel numbering, meeting title etc. as named entity; Some speech belongs to the physical name in the specialized field, and for example medicine name, steamer name, reference list etc. also should be included into it in limit of consideration.
In general, (Named Entity Recognition, task NER) is defined as identifying in the text existing proprietary name and significant quantity phrase and is sorted out named entity recognition.In the real work of named entity recognition, the identification of time expression formula, numeral expression formula etc. is simple relatively, and the statistics training of the design of its rule, data etc. are also than being easier to.And the name in the entity, place name, mechanism's name, because have the characteristics such as randomness of opening, expansionary, composing law, so its identification just may have more mistake choosing or leak and select, thereby the technological innovation of this three classes named entity recognition also more has challenge.Concrete, name comprises national name, foreign transliteration name etc.; Place name comprises city, country, street, provinces and cities counties and townships, river, mountains and rivers etc.; Mechanism's name comprises exabyte, governments at all levels' combination name, the council etc.
Technical method based on the named entity recognition of text message has mainly experienced two developing stage: rule-based method and based on the method for statistics.Early stage method substantially all be belong to the rule method, comparatively the tradition, in description is all arranged.As Description of the LaSIE-II system as used for MUC-7 (author: Humphreys K), A.Mikheev etc.) Named entity recognition without gazeteers (author: the method that proposes such as.Though rule-based method degree of accuracy is very high, its spent resource comprises that the man power and material is huge, and subjectivity is very strong, and along with the surge of number of documents on the internet and the continuous variation of demand, it is unable to do what one wishes that rule-based method begins to seem.Simultaneously, be accompanied by the quickening of COMPUTER CALCULATION speed and the appearance of a large amount of idiom material, make statistical method become the main stream approach that realizes named entity recognition.Hidden Markov model (HMM), maximum entropy model (ME), traditional decision-tree, the converting machine learning method that drives based on mistake all is applied to the research of named entity recognition.Wherein, conditional random field models (CRF) has been obtained the effect that obviously is better than additive method, in recent years obtain extensive concern always, this obtains embodying in many papers, as Chinese Segmentation and New Word Detecting using Conditional Random Fields (author: Fuchun Peng etc.), Early results for named entity recognition with conditional random fields (author: A.McCallum etc.).
Current commonplace use be rule and the method that combines of statistics.Both relative merits form complementary relationship.No matter be comprehensively to extract different characteristic, still select machine learning methods such as supervision formula, semi-supervised formula, non-supervision formula for use, its prerequisite all demand side is analyzed and is known where the shoe pinches to different language form and text formatting.The randomness of the expansionary and word-building mode that Chinese named entity itself is had, and the sharing between all kinds of speech and restriction are all brought very big difficulty to named entity recognition.Speech is a fuzzy notion in Chinese, does not have clear and definite definition.Even the people understands the situation that the border ambiguity also can appear in Chinese, machine processing is more inevitable.The generation rule of Chinese named entity and structure are complicated more, and especially the representation of abbreviation has diversity, are difficult to extract composition rule, therefore can not be applied to all named entities with a kind of model of cognition.Especially, compare with English, Chinese lacks the morphology transform characteristics that plays an important role in named entity recognition.And, up to the present, the large-scale open language material that can be used for Chinese named entity identification also seldom, the researcher is mainly based on the Peoples Daily mark language material in 1-6 months in 1998 of the employing Beijing University mark collection mark of generally acknowledging basically, and the either traditional and simplified characters language material of Microsoft Research, Asia's issue.
News analysis in the internet is meant the comment that common viewer issues at a certain incident personage's etc. body in the website with comment issue authority, is present people one of important sources of obtaining of information on the internet.Many important use and research topic have been produced based on news analysis information.For example, public sentiment is analyzed, this is the hot research problem of natural language processing in nearly ten years and information retrieval field, its target is to identify topic of system's the unknown and the report relevant with this topic from continuous record, is one of prerequisite step of analyzing and carry out named entity recognition accurately.
Chinese news analysis on the internet is to be used by different network individually to deliver the media of viewpoint according to own wish, hobby, the comment people by to certain first news make comments to finish and obtain the role conversion of taker to the supplier from internet information.Separate substantially between the comment individual human, this has just caused the not only different attention rate difference that news messages obtained very big, and the text representation of every then comment lacks semantic unity.Specifically, mainly contain following characteristics:
1, text formatting is irregular.Because comment people miscellaneous is come from news analysis, often comprise a large amount of noise fragments in the comment text, comprise character misspellings (deliberately wrongly write or this is to be not intended to wrongly write because of keyboard operation) because the comment people comprises certain emotion tendency, the special of punctuate used with, unnecessary space bar, no practical significance character, irregular title and writing a Chinese character in simplified form etc., this noise fragment is brought many interference for the analyzing and processing robotization.For example, " Huiyuan " (a kind of beverage brand) may be " converging round " by misspellings, and the space between each Chinese character is insignificant in " resistance Coca-Cola ".
2, various free editor Formats.Variations such as comment people's knowledge background, schooling, are selected also variation of vocabulary at expression way, and different used words of comment people even sentence structure may be different, and want the viewpoint expressed close.
3, word is succinct relatively.The network comment people makes the comments and trends towards using cyberspeak and habitual popular vocabulary etc., and this use habit may not be expressed grammer by conformance with standard Chinese, but generally is familiar with approval by the netizen.Especially, comment people's word is brief with sentence formula trend, and a lot of comments have only two or three word formation.
4, topic is relevant.The comment people is basically showing emotion or viewpoint is a purpose, and news analysis is sent out at the personage who mentions in the body or dependent event especially, thereby between text and comment, comment and the comment very strong mutual correlation arranged.
5, research language material miss status.All issuing a large amount of news every day on the internet, and the thing followed is the comment corpus of magnanimity, but these language materials all are coarse untreated webpages.Up to now, also there is not correlative study in this sub-field of the named entity recognition in the comment, thereby lacks the entity mark standard of generally acknowledging for the researchist, and can be in complete space state for the mark language material of research yet.
More than these characteristics, all caused predefine character data in the character data, as the bigger problem of named entity recognition error of Chinese news analysis class data.Owing to there is this class problem, can cause in the process of network operation such as data retrieval, translation in the later stage, exist retrieve data error, range of search inaccurate, later stage problems such as translation error, therefore, how to excavate the characteristics utilized in the Chinese news analysis effectively, select rational feature and machine learning method, improve the precision of named entity recognition in the comment, with reach internet information extract in better practical application effect, become in the present natural language processing task emphasis and difficult point.
Summary of the invention
The method and apparatus that the present invention aims to provide a kind of character data identification and handles, it can solve the bigger problem of predefine character data identification error in the above-mentioned character data.
According to an aspect of the present invention, provide character data identification and the method for handling, comprising: according to benchmark language material and benchmark template the characteristic character data are discerned, obtained each named entity and distinguish corresponding different entity title; The feature that obtains each entity title is sewed frequency; Sew frequency, described benchmark template and predefine language material according to described feature pending character data is discerned, obtain described each named entity and distinguish corresponding different entity title; The physical name that will identify from described pending character data is referred to as data parameters and carries out subsequent analysis and handle.
Preferably, the described process of the characteristic character data being discerned according to benchmark language material and benchmark template comprises: adopt with conditional random field models CRF instrument described benchmark language material and benchmark template are handled, by drawing the benchmark model of cognition after handling the characteristic character data are discerned.
Preferably, the process that the feature of each entity title of described acquisition is sewed frequency comprises: obtain described each entity title characteristic of correspondence prefix and feature suffix, add up pairing feature prefix frequency and feature suffix frequency.
Preferably, sewing the process that frequency, described benchmark template and predefine language material discern pending character data according to described feature comprises: adopt and with conditional random field models CRF instrument described feature is sewed frequency, described benchmark template and predefine language material and handle, by the feature identification model that draws after handling the characteristic character data are discerned.
Preferably, adopt the process of described feature being sewed frequency, described benchmark template and the processing of predefine language material to comprise: feature is sewed the characteristic series constitutive characteristic recognition template of frequency as the benchmark template with conditional random field models CRF instrument; Handle described feature identification template and described predefine language material draws the feature identification model by described CRF instrument.
Preferably, described execution subsequent analysis is handled and comprised: the physical name that described feature identification template is identified is referred to as the keyword of match retrieval, carries out retrieval process; Or the physical name that described feature identification template identifies is referred to as the keyword of translation coupling, carry out Translation Processing.
According to another aspect of the present invention, character data identification and the device of handling comprise: recognition unit, be used for the characteristic character data being discerned according to benchmark language material and benchmark template, and obtain each named entity and distinguish corresponding different entity title; Or sew frequency, described benchmark template and predefine language material according to feature pending character data is discerned, obtain described each named entity and distinguish corresponding different entity title; Statistic unit, the feature that is used for obtaining each entity title that described recognition unit identifies from the characteristic character data is sewed frequency; Processing unit, the physical name that is used for identifying from described pending character data are referred to as data parameters and carry out the subsequent analysis processing.
Preferably, described recognition unit comprises: conditional random field models CRF instrument, adopt so that described benchmark language material and benchmark template are handled, and by drawing the benchmark model of cognition after handling the characteristic character data are discerned; Or described feature is sewed frequency, described benchmark template and predefine language material handle, by the feature identification model that draws after handling the characteristic character data are discerned; Interface module is used for the entity title that the characteristic character data identification goes out being exported the benchmark model of cognition to described statistic unit; Or described statistic unit is counted described feature sew frequency and input to the CRF instrument.
Adopt method and apparatus of the present invention, in the process of identification, having added feature sews as recognition feature and is listed as, so it is inaccurate to have overcome identification, the bigger problem of predefine character data identification error when causing later stage retrieval, translation, and then reached named entity recognition precision in the news analysis of clear and definite improvement Chinese, avoid freedom of expression or enough standards and named entity unrecognized or that gone out by wrong identification.
Description of drawings
Accompanying drawing described herein is used to provide further understanding of the present invention, constitutes the application's a part, and illustrative examples of the present invention and explanation thereof are used to explain the present invention, do not constitute improper qualification of the present invention.In the accompanying drawings:
Fig. 1 shows the character data identification of the embodiment of the invention one and the process flow diagram of the method handled;
Fig. 2 shows the character data identification of the embodiment of the invention two and the process flow diagram of the method handled;
Fig. 3 shows the synoptic diagram of the processing procedure of the embodiment of the invention;
Fig. 4 shows the character data identification of the embodiment of the invention and the structural drawing of the device handled.
Embodiment
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.In the character data processing procedure, as to operation such as character data translation, retrieval, particular analysis processing the time, operation for data source, need accuracy, the correctness of data source higher, when obtaining named entity, different procurement processes, can getable difference as a result, thereby cause follow-up processing deviation to occur, cause result inaccurate, generate as topic detection and tracking, information retrieval, mechanical translation, automatic document summary etc.
Especially in the named entity recognition method in novel news analysis, many because of freedom of expression or enough standards are unrecognized or the named entity that gone out by wrong identification.This public sentiment monitoring for news analysis, emotion analysis, the identification of positive and negative aspect viewpoint etc. have the basis and important effect.
Embodiment one.
Referring to Fig. 1, mainly may further comprise the steps according to the character data identification of the embodiment of the invention and the method for handling:
S11: according to benchmark language material and benchmark template the characteristic character data are discerned, obtained each named entity and distinguish corresponding different entity title;
S12: the feature that obtains each entity title is sewed frequency;
S13: sew frequency, described benchmark template and predefine language material according to described feature pending character data is discerned, obtain described each named entity and distinguish corresponding different entity title;
S14: the physical name that will identify from described pending character data is referred to as data parameters and carries out subsequent analysis and handle.
But by each entity title of top embodiment accurate recognition, detail parameters in the identifying and recognition data describe by embodiment two.
Embodiment two
Set forth the embodiment two of the inventive method below, the present invention can be applicable in the middle of all kinds of character datas, as Chinese or other national linguistic notation data, the mathematic sign data, logical symbol data etc., and be to handle after unit discerns with speech or word etc., the embodiment that the present invention provides is that example describes with Chinese news analysis, for example import a news web page, can correctly extract headline wherein, after the set of body and related commentary, feed back the name in body and every the comment, place name, the recognition result of mechanism's name carries out corresponding data processing.
Embodiment two is that example describes with the webpage lteral data, for example, news data in the webpage lteral data is carried out named entity recognition, in the news data, mainly comprise headline, news analysis, body etc., and in these data, main what pay close attention to is named entity recognition in the news analysis, identifies name, place name, mechanism's name.Referring to Fig. 2, mainly may further comprise the steps according to the character data identification of the embodiment of the invention and the method for handling:
S21: adopt with conditional random field models CRF instrument, benchmark language material and benchmark template are handled, generate the benchmark model of cognition;
Conditional random field models is to carry out the best machine learning algorithm of named entity recognition effect at present.Being proposed in Conditional random field:Probabilistic models for segmenting and labeling sequence data in calendar year 2001 by J.Lafferty, is a kind of based on the sequence mark of statistics and the method for partition data.Condition random field uses a kind of probability graph model, has the ability of expressing the dependent overlapping feature of long distance, can solve the advantage of mark biasing problem preferably, and all features can carry out global normalization, can try to achieve globally optimal solution.
This technology is all selected CRF kit Pocket CRF 0.45 training tool as the CRF model for use to the named entity recognition in body and the related commentary, wherein relates to participle and part-of-speech tagging pre-service and all selects the upright participle 4.0 of participle kit for use.
The benchmark language material is for adopting the Peoples Daily mark language material and Microsoft Research, Asia's simplified form of Chinese Character language material in 1-6 month in 1998; The benchmark template can adopt the benchmark template Template1 of predefined part of speech and participle, and is as shown in table 1,
The feature benchmark template Template1 that table 1 named entity recognition is used
Figure B2009102429754D0000101
Wherein O represents the input observation state row of CRF, and F represents its participle characteristic series, and C represents its part of speech characteristic series, and S represents the named entity label column.
Part-of-speech tagging when doing word segmentation processing again is as follows: each sentence in the benchmark language material is handled by upright participle 4.0, obtained participle and part-of-speech tagging label.Especially, no matter be participle, part-of-speech tagging or named entity recognition, these tasks are regarded as the mark problem of word at this.4 lexeme methods selected for use in participle, i.e. prefix (B), suffix (E), (I) and monosyllabic word (S) in the speech.Be exemplified below:
A: can not produce real entrepreneur under the state ownership.
B: country/ownership system/down/can not/produce/real/entrepreneur/.
C: state/B family/E institute/B has/I system/E under/S not /B energy/E product/B life/E is true/B just/E enterprise/B industry/I family/E./S
A is former sentence, and B is the result behind the participle, and C is the participle label of each word in the sentence.
The part of speech label of part-of-speech tagging is as shown in table 2.
The part of speech tag set that the upright part-of-speech tagging of table 2 adopts
Figure B2009102429754D0000111
A speech is noted as certain part of speech, and then this part of speech label all will be given in each word in the speech.Concrete annotation results is exemplified below:
C: under the country/n ownership system/n/f can not/v generation/v is real/b entrepreneur/n.
D: state/n family/n institute/n has/n system/n under/f not /v energy/v product/v life/v is true/b just/b enterprise/n industry/n family/n./w
By above-mentioned benchmark language material and benchmark template Template1 input, generate benchmark model of cognition Model1 as CRF.
S22: adopt the benchmark model of cognition that the characteristic character data are discerned, obtain each named entity and distinguish corresponding different entity title;
In the present embodiment, the characteristic character data are meant headline and the body in news data, and given news web page URL is carried out pre-service, extract headline, body, the comment set waits label, and the result is shown in the XML format file that generates among Fig. 3 in its arrangement.
Adopt benchmark model of cognition Model1 to the characteristic character data, be that headline, body label are discerned, obtain the recognition result of named entity, extract wherein name, place name, mechanism's name three class named entities, obtain each named entity and distinguish corresponding different entity title.For example: corresponding each eponym of name entity institute: Zhang San, Li Si ... corresponding each place name title of place name entity institute: Beijing, Shanghai ... etc.
S23: the entity title of the named entity by each type of identifying, the feature that obtains each title is sewed frequency;
By each entity title that identifies, dictionary form with the corresponding relation of named entity and entity title is preserved, because the strong correlation of comment and body, the named entity of the standard in the body has directive significance to the named entity recognition in the comment.Though it is many lack of standardization that the expression of the entity in comment has, and adopts abbreviation, another name, abbreviation etc. in a large number, often kept the first and last word feature of named entity.With the name is example, and Chinese personal name all is made of surname and name, and the surname of Chinese personal name all has certain selectivity with word and name with word.Thus, we extract the lead-in of name entity and last word respectively as prefix list and suffix list from dictionary, count asyllabia frequency number in dictionary before and after each simultaneously in the tabulation.
For example: the entity title of name is listed as by B, E, and A, four lexemes of N are formed.Its implication is as shown in table 3:
The tabulation of lexeme implication is sewed in table 3 front and back
Place name, that mechanism's name is sewed rule definition with the front and back of name is identical, special, when next step is trained with CRF, feature is sewed in the front and back of three class named entities be put on three characteristic serieses, so can use same set of lexeme label fully.
S24: adopt described each feature to sew frequency, benchmark template and predefine language material and handle the generating feature recognition template by described CRF instrument;
The predefine language material can be selected according to user's demand, named entity recognition with present embodiment is an example, predefine is practiced Sina's hot news that language material is selected from the October, 2008, totally five ten news, and they are evenly distributed on politics, economy, physical culture, amusement, science and technology five big fields.Remove irrelevant noise comments such as advertisement, every news selects associated nearest 100 comments to form the mark set.During concrete the mark,, on the named entity definition of in the past generally acknowledging,, added the following basic rule that marks according to the expression characteristic of news analysis at the basic definition of name, place name, mechanism's name:
(A) the mark task need mark the named entity in news main body and the comment content simultaneously.If the named entity in the comment occurred in body, and it is different in method for expressing and the body in the comment, then it may be noted that the named entity that it is corresponding in body, if in the body same entity is had a plurality of different method for expressing, the named entity in the comment points to the most regular named entity of saying in the original text.
For example: 1) fact proves once more, can not yield to simply mould state---mould state is corresponding to the U.S. in the original text.
2) will impact the king of team's hit rate if said Yi last year, I must be full of hope---and Yi is corresponding to the Yi Jianlian in the original text.
(B) the mark process should it may be noted that the actual class of each named entity based on the understanding to comment.For example: US team with the match of China in should be marked as [O US team] with the match of [O China] in according to the semanteme of sentence, the actual expression of China be China Team, thereby be marked as organization name.
(C) for some named entities that not only can be labeled as place name but also can be labeled as organization name, mark based on following marking principles: if in context, this named entity refers to certain geographic position that has its meaning on the space, then this named entity is labeled as place name, otherwise is labeled as mechanism's name.For example: 1) next stop in the Hai Dianhuang village is that the Peking University here of Peking University is meant certain ad-hoc location, thereby is noted as place name.
1) Peking University's target is to build up the Peking University here of world-class university not to be meant certain ad-hoc location, and is meant a group, thereby is noted as organization name.
(D) for the bright nested situation of a plurality of place names/organize, in line with the principle of separating mark, if last place name/mechanism's name can not constitute a named entity separately, then it and the place name in its front/mechanism's name is merged and is designated as a named entity, otherwise nested each place name/mechanism's name all independent be designated as a named entity.
For example: 1) Peking University, Haidian District, Beijing City be marked as/[L Beijing]/[L Haidian District]/[O Peking University]/
2) Haidian District, Beijing City Education Commission is marked as/[L Beijing]/[O Haidian District Education Commission]
In the above example, Education Commission can not constitute a named entity separately, and Peking University then can.
(E), have should be noted that at following 2 about the name mark:
1) the expression formula form of surname+appellation does not mark appellation among the name without exception, for example magnifies grandfather and is labeled as [P opens] uncle, and teacher Wang is labeled as [P king] teacher.
2) for some entities that may produce ambiguity, need comprise the qualifier that is used to distinguish entity during mark, Bush, Jr for example, old Bush etc., need be labeled as [P Bush, Jr], [P old Bush] divided the work to mark by two natural language processing research field personnel at last, and run into difference and consult to finish, and crosscheck mutually.
Introduced the selection of predefine language material above, in generative process, each feature has been sewed frequency, benchmark template and predefine language material handle generating feature recognition template Template2 by described CRF instrument.
Template2 has kept all template characteristic among the Template1, and increased the template of sewing characteristic series at before and after name, place name, the mechanism's name, template behind concrete the increasing sees Table 4, wherein PER represents to sew characteristic series before and after the name, LOC represents to sew characteristic series before and after the place name, sews characteristic series before and after the ORG outgoing mechanism name.
Table 4Template2
Figure B2009102429754D0000151
Figure B2009102429754D0000161
S25: adopt described feature identification template that pending character data is discerned, obtain described each named entity and distinguish corresponding different entity title;
Owing to will discern to news analysis, therefore, news analysis is discerned as pending character data, identify the entity title of relevant all kinds of names, place name, mechanism's name named entity.In identifying, sew frequency owing to added feature, can sew frequency by feature and further judge the title that is identified entity, thereby accurate recognition goes out the entity title more.
In order to verify the validity of identifying, the news analysis language material set that mark is good is divided into five parts, has carried out five intersection (5-folder) experiments.Each training set is combined into a 8*5 piece of writing, 8 pieces of economy, politics, physical culture, amusement, the every classes of science and technology five big classes, and all the other language materials are gathered as test, are a 2*5 piece of writing.
For the evaluation and test of named entity recognition research, adopt accuracy rate (Precision) and two indexs of recall rate (Recall) to weigh generally, this also is the method that MUC (Message Understanding Conference) meeting is evaluated and tested.
Figure B2009102429754D0000162
Figure B2009102429754D0000163
For the performance of overall evaluation system, also can calculate F value (F-Measure) usually, i.e. the weighted geometric mean of accuracy rate and recall rate, computing formula is as follows:
F - Measure = ( ( beta ) 2 + 1.0 ) * Precision * Recall ( ( beta ) 2 × Precision ) × Recall × 100 %
Wherein, generally, beta=1.
Obtain comparing result such as table 5, shown in the table 6, table 7:
The experimental result of table 5 name identification
Recall rate Accuracy rate The F value
Sew feature before and after not using 69.35% 80.44% 74.49%
Sew feature before and after using 78.39% 85.65% 81.86%
The experimental result of table 6 place name identification
Recall rate Accuracy rate The F value
Sew feature before and after not using 90.95% 89.90% 90.42%
Sew feature before and after using 91.54% 91.44% 91.49%
The experimental result of table 7 mechanism name identification
Recall rate Accuracy rate The F value
Sew feature before and after not using 50.44% 76.34% 60.74%
Sew feature before and after using 59.85% 78.30% 67.84%
We can see that the front and back of using the named entity dictionary in the body to provide sew characteristic series with respect to before not using from table 5, table 6, table 7, and recall rate, accuracy rate, F value all are significantly improved, and have proved that this paper proposes the superiority of algorithm.
S26: the physical name that described feature identification template is identified is referred to as the processing of data parameters execution subsequent analysis.
In subsequent analysis processing, can will be referred to as data parameters to the physical name that identifies, as the matching parameter of search key, operations such as the translation parameter during translation, particular analysis processing, correctness is higher.Especially in the named entity recognition method in novel news analysis, because of freedom of expression or enough standards are unrecognized or all can correctly be identified by the named entity that wrong identification goes out.This public sentiment monitoring for news analysis, emotion analysis, the identification of positive and negative aspect viewpoint etc. have the basis and important effect.
Embodiment three
Fig. 4 shows the structural drawing of apparatus of the present invention.As shown in Figure 4, according to the character data identification of the embodiment of the invention and the device of handling, comprising:
1) recognition unit 40, are used for according to benchmark language material and benchmark template the characteristic character data being discerned, and obtain each named entity and distinguish corresponding different entity title; Or sew frequency, described benchmark template and predefine language material according to feature pending character data is discerned, obtain described each named entity and distinguish corresponding different entity title;
2) statistic unit 41, and the feature that is used for obtaining each entity title that described recognition unit 40 identifies from the characteristic character data is sewed frequency;
3) processing unit 42, and the physical name that is used for identifying from described pending character data is referred to as data parameters and carries out the subsequent analysis processing.
Preferably, described recognition unit 40 comprises:
1) conditional random field models CRF instrument is handled described benchmark language material and benchmark template, by drawing the benchmark model of cognition after handling the characteristic character data is discerned; Or adopt with conditional random field models described feature is sewed frequency, described benchmark template and the processing of predefine language material, by the feature identification model that draws after handling the characteristic character data are discerned;
2) interface module is used for the entity title that the characteristic character data identification goes out being exported the benchmark model of cognition to described statistic unit; Or described statistic unit is counted described feature sew frequency and input to the CRF instrument.
Can adopt the method in the foregoing description 1 and 2 to carry out character data identification and processing according to the character data identification of the embodiment of the invention and the device of handling, so repeat no more in the processing procedure of this device that this character data is discerned and handled.
Obviously, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with the general calculation device, they can concentrate on the single calculation element, perhaps be distributed on the network that a plurality of calculation element forms, alternatively, they can be realized with the executable program code of calculation element, carry out by calculation element thereby they can be stored in the memory storage, perhaps they are made into each integrated circuit modules respectively, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.Within the spirit and principles in the present invention all, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (8)

1. the method that character data is discerned and handled is characterized in that, comprising:
According to benchmark language material and benchmark template the characteristic character data are discerned, obtained each named entity and distinguish corresponding different entity title;
The feature that obtains each entity title is sewed frequency;
Sew frequency, described benchmark template and predefine language material according to described feature pending character data is discerned, obtain described each named entity and distinguish corresponding different entity title;
The physical name that will identify from described pending character data is referred to as data parameters and carries out subsequent analysis and handle.
2. method according to claim 1 is characterized in that, the described process of the characteristic character data being discerned according to benchmark language material and benchmark template comprises:
Employing is handled described benchmark language material and benchmark template with conditional random field models CRF instrument, by drawing the benchmark model of cognition after handling the characteristic character data is discerned.
3. method according to claim 1 and 2 is characterized in that, the process that the feature of each entity title of described acquisition is sewed frequency comprises:
Obtain described each entity title characteristic of correspondence prefix and feature suffix, add up pairing feature prefix frequency and feature suffix frequency.
4. method according to claim 3 is characterized in that, sews the process that frequency, described benchmark template and predefine language material discern pending character data according to described feature and comprises:
Employing is sewed frequency, described benchmark template and the processing of predefine language material with conditional random field models CRF instrument to described feature, by the feature identification model that draws after handling the characteristic character data is discerned.
5. method according to claim 4 is characterized in that, adopts the process of described feature being sewed frequency, described benchmark template and the processing of predefine language material with conditional random field models CRF instrument to comprise:
Feature is sewed the characteristic series constitutive characteristic recognition template of frequency as the benchmark template;
Handle described feature identification template and described predefine language material draws the feature identification model by described CRF instrument.
6. method according to claim 1 is characterized in that, described execution subsequent analysis is handled and comprised:
The physical name that described feature identification template is identified is referred to as the keyword of match retrieval, carries out retrieval process; Or
The physical name that described feature identification template is identified is referred to as the keyword that translation is mated, and carries out Translation Processing.
7. the device that character data is discerned and handled is characterized in that, comprising:
Recognition unit is used for according to benchmark language material and benchmark template the characteristic character data being discerned, and obtains each named entity and distinguishes corresponding different entity title; Or sew frequency, described benchmark template and predefine language material according to feature pending character data is discerned, obtain described each named entity and distinguish corresponding different entity title;
Statistic unit, the feature that is used for obtaining each entity title that described recognition unit identifies from the characteristic character data is sewed frequency;
Processing unit, the physical name that is used for identifying from described pending character data are referred to as data parameters and carry out the subsequent analysis processing.
8. device according to claim 7 is characterized in that, described recognition unit comprises:
Conditional random field models CRF instrument is handled described benchmark language material and benchmark template, by drawing the benchmark model of cognition after handling the characteristic character data is discerned; Or described feature is sewed frequency, described benchmark template and predefine language material handle, by the feature identification model that draws after handling the characteristic character data are discerned;
Interface module is used for the entity title that the characteristic character data identification goes out being exported the benchmark model of cognition to described statistic unit; Or described statistic unit is counted described feature sew frequency and input to the CRF instrument.
CN2009102429754A 2009-12-22 2009-12-22 Character data recognition and processing method and device Pending CN102103594A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009102429754A CN102103594A (en) 2009-12-22 2009-12-22 Character data recognition and processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009102429754A CN102103594A (en) 2009-12-22 2009-12-22 Character data recognition and processing method and device

Publications (1)

Publication Number Publication Date
CN102103594A true CN102103594A (en) 2011-06-22

Family

ID=44156371

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009102429754A Pending CN102103594A (en) 2009-12-22 2009-12-22 Character data recognition and processing method and device

Country Status (1)

Country Link
CN (1) CN102103594A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164426A (en) * 2011-12-13 2013-06-19 北大方正集团有限公司 Method and device of recognizing named entity
CN103995885A (en) * 2014-05-29 2014-08-20 百度在线网络技术(北京)有限公司 Method and device for recognizing entity names
WO2015027867A1 (en) * 2013-08-28 2015-03-05 International Business Machines Corporation Authorship enhanced corpus ingestion for natural language processing
CN106445922A (en) * 2016-10-09 2017-02-22 合网络技术(北京)有限公司 Method and device for determining title of multimedia resource
CN107368466A (en) * 2017-06-27 2017-11-21 成都准星云学科技有限公司 A kind of name recognition methods and its system towards elementary mathematics field
CN107392111A (en) * 2017-06-27 2017-11-24 青岛海信电器股份有限公司 Advertisement recognition method and device
CN107832360A (en) * 2017-10-24 2018-03-23 广东欧珀移动通信有限公司 Comment processing method and relevant device
CN107861965A (en) * 2017-05-19 2018-03-30 广东精点数据科技股份有限公司 Data intelligence recognition methods and system
CN107943786A (en) * 2017-11-16 2018-04-20 广州市万隆证券咨询顾问有限公司 A kind of Chinese name entity recognition method and system
CN108710855A (en) * 2018-05-22 2018-10-26 山西同方知网数字出版技术有限公司 A kind of Text region editing method
CN109791570A (en) * 2018-12-13 2019-05-21 香港应用科技研究院有限公司 Efficiently and accurately name entity recognition method and device
CN110399452A (en) * 2019-07-23 2019-11-01 福建奇点时空数字科技有限公司 A kind of name list of entities generation method of Case-based Reasoning feature modeling
CN111144334A (en) * 2019-12-27 2020-05-12 北京天融信网络安全技术有限公司 File matching method and device, electronic equipment and storage medium
WO2020118741A1 (en) * 2018-12-13 2020-06-18 Hong Kong Applied Science and Technology Research Institute Company Limited Efficient and accurate named entity recognition method and apparatus

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164426A (en) * 2011-12-13 2013-06-19 北大方正集团有限公司 Method and device of recognizing named entity
CN103164426B (en) * 2011-12-13 2015-10-28 北大方正集团有限公司 A kind of method of named entity recognition and device
WO2015027867A1 (en) * 2013-08-28 2015-03-05 International Business Machines Corporation Authorship enhanced corpus ingestion for natural language processing
US9483519B2 (en) 2013-08-28 2016-11-01 International Business Machines Corporation Authorship enhanced corpus ingestion for natural language processing
US10795922B2 (en) 2013-08-28 2020-10-06 International Business Machines Corporation Authorship enhanced corpus ingestion for natural language processing
CN103995885A (en) * 2014-05-29 2014-08-20 百度在线网络技术(北京)有限公司 Method and device for recognizing entity names
CN103995885B (en) * 2014-05-29 2017-11-17 百度在线网络技术(北京)有限公司 The recognition methods of physical name and device
CN106445922A (en) * 2016-10-09 2017-02-22 合网络技术(北京)有限公司 Method and device for determining title of multimedia resource
CN106445922B (en) * 2016-10-09 2020-02-18 合一网络技术(北京)有限公司 Method and device for determining title of multimedia resource
WO2018064959A1 (en) * 2016-10-09 2018-04-12 优酷网络技术(北京)有限公司 Method and device for determining title of multimedia resource
CN107861965A (en) * 2017-05-19 2018-03-30 广东精点数据科技股份有限公司 Data intelligence recognition methods and system
CN107392111A (en) * 2017-06-27 2017-11-24 青岛海信电器股份有限公司 Advertisement recognition method and device
CN107392111B (en) * 2017-06-27 2020-06-23 海信视像科技股份有限公司 Advertisement identification method and device
CN107368466A (en) * 2017-06-27 2017-11-21 成都准星云学科技有限公司 A kind of name recognition methods and its system towards elementary mathematics field
CN107832360A (en) * 2017-10-24 2018-03-23 广东欧珀移动通信有限公司 Comment processing method and relevant device
CN107943786A (en) * 2017-11-16 2018-04-20 广州市万隆证券咨询顾问有限公司 A kind of Chinese name entity recognition method and system
CN108710855A (en) * 2018-05-22 2018-10-26 山西同方知网数字出版技术有限公司 A kind of Text region editing method
CN109791570A (en) * 2018-12-13 2019-05-21 香港应用科技研究院有限公司 Efficiently and accurately name entity recognition method and device
WO2020118741A1 (en) * 2018-12-13 2020-06-18 Hong Kong Applied Science and Technology Research Institute Company Limited Efficient and accurate named entity recognition method and apparatus
CN110399452A (en) * 2019-07-23 2019-11-01 福建奇点时空数字科技有限公司 A kind of name list of entities generation method of Case-based Reasoning feature modeling
CN111144334A (en) * 2019-12-27 2020-05-12 北京天融信网络安全技术有限公司 File matching method and device, electronic equipment and storage medium
CN111144334B (en) * 2019-12-27 2023-09-26 北京天融信网络安全技术有限公司 File matching method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN102103594A (en) Character data recognition and processing method and device
Jung Semantic vector learning for natural language understanding
Shoufan et al. Natural language processing for dialectical Arabic: A survey
Han et al. Lexical normalisation of short text messages: Makn sens a# twitter
Benajiba et al. Arabic named entity recognition using conditional random fields
CN103049435B (en) Text fine granularity sentiment analysis method and device
Tulkens et al. Evaluating unsupervised Dutch word embeddings as a linguistic resource
Kaur et al. A survey of named entity recognition in English and other Indian languages
Maynard et al. Towards a semantic extraction of named entities
Yang et al. Extracting comparative entities and predicates from texts using comparative type classification
Şeker et al. Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content 1
Yeh et al. Chinese word spelling correction based on n-gram ranked inverted index list
Stern et al. A joint named entity recognition and entity linking system
Hamdi et al. In-depth analysis of the impact of OCR errors on named entity recognition and linking
Jain Domain-specific knowledge graph construction for semantic analysis
Qiu et al. ChineseTR: A weakly supervised toponym recognition architecture based on automatic training data generator and deep neural network
Liu et al. Opinion searching in multi-product reviews
Rao et al. ESM-IL: Entity Extraction from Social Media Text for Indian Languages@ FIRE 2015-An Overview.
CN101933017A (en) Document search device, document search system, document search program, and document search method
Yoon et al. Data-centric and model-centric approaches for biomedical question answering
Mohnot et al. Hybrid approach for Part of Speech Tagger for Hindi language
Hakkani-Tur et al. Statistical sentence extraction for information distillation
Tian et al. Research of product ranking technology based on opinion mining
Hasan et al. Pattern-matching based for Arabic question answering: a challenge perspective
CN102207947B (en) Direct speech material library generation method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20110622