CN103995885B - The recognition methods of physical name and device - Google Patents

The recognition methods of physical name and device Download PDF

Info

Publication number
CN103995885B
CN103995885B CN201410234622.0A CN201410234622A CN103995885B CN 103995885 B CN103995885 B CN 103995885B CN 201410234622 A CN201410234622 A CN 201410234622A CN 103995885 B CN103995885 B CN 103995885B
Authority
CN
China
Prior art keywords
root
identified
text
name
instance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410234622.0A
Other languages
Chinese (zh)
Other versions
CN103995885A (en
Inventor
陈丽欧
徐明泉
韩锋
姜世超
周寰
王平
雷绍泽
周丰乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201410234622.0A priority Critical patent/CN103995885B/en
Publication of CN103995885A publication Critical patent/CN103995885A/en
Application granted granted Critical
Publication of CN103995885B publication Critical patent/CN103995885B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention proposes recognition methods and the device of a kind of physical name, wherein, the recognition methods of the physical name, including:Obtain the source-information of text to be identified and text to be identified;The first instance name in text to be identified is obtained according to the source-information of text to be identified and identification model;Second instance name is obtained from text to be identified according to the root chart and default constraint rule pre-established in the content of non-first instance name.The recognition methods of the physical name of the present invention, the accuracy rate and recall rate of physical name identification are improved, is applicable to various language forms, versatility is stronger.In addition, effective identification for the physical name in intention text, greatly meets individual demand in intention.

Description

The recognition methods of physical name and device
Technical field
The present invention relates to internet information processing technology field, the recognition methods of more particularly to a kind of physical name and device.
Background technology
With the fast development widely used with internet of computer, Internet resources gradually enrich, and information content is drastically Increase.In order to allow users to quickly find the information really needed in the information source of magnanimity, it is necessary to carry out information document Processing, with automatic identification physical name therein, in order to which user searches the information needed according to physical name.At present, to physical name Automatic identification be a technical barrier.The type of physical name is different, and it identifies that difficulty and recognition methods are also different.Entity The identification of name mainly has the method for statistical learning and rule-based recognition methods two ways.Wherein:
The method of statistical learning includes training stage and cognitive phase, in the training stage, on the basis of language material is marked, leads to Cross and extract correlated characteristic and select suitable machine learning strategy to train the model that proper name identifies;In cognitive phase, instruction is used The proper name that the model that the white silk stage obtains comes in the new language material of automatic identification.But in the training stage, it is necessary to manually mark, proofread instruction Practice language material, take time and effort very much, and physical name constantly changes, and has some new physical names frequently and occurs, therefore trains language Material is also required to often update, and this very labor intensive resource, wastes time and energy, and accuracy rate is not high.
The thought of rule-based recognition methods is for identifying that it is some that the linguistic knowledge of physical name is written as by the mankind Rule, machine is allowed to carry out automatic identification to the physical name in text according to these rules.These rules are generally dependent upon tool Body language form, such as Chinese, English.But for identifying these regular excessively very complicateds, and knowledge encoding of physical name Work at present also without unified guiding method, therefore, rule-based method, it is necessary to compile respectively for different language Recognition rule is write, workload is big, and versatility is poor.
Therefore, at present, the recognition methods general applicability of physical name is poor, and preparation amount is big, it is difficult to realizes simultaneously High-accuracy and the consuming of low human resources.
The content of the invention
It is contemplated that at least solves above-mentioned technical problem to a certain extent.
Therefore, first purpose of the present invention is to propose a kind of recognition methods of physical name, this method can lift reality The accuracy rate and versatility of body name identification.
Second object of the present invention is to propose a kind of identification device of physical name.
For the above-mentioned purpose, embodiment proposes a kind of recognition methods of physical name according to a first aspect of the present invention, including: Obtain the source-information of text to be identified and the text to be identified;According to the source-information of the text to be identified and identification Model obtains the first instance name in the text to be identified;According to the root chart and default constraint rule pre-established from described Second instance name is obtained in text to be identified in the content of non-first instance name.
The recognition methods of the physical name of the embodiment of the present invention, obtained according to the source-information of text to be identified and identification model First instance name in text to be identified, and the second instance name in root chart and preset rules acquisition text to be identified, Statistical learning method and the advantage both rule-based recognition methods are fully combined, improves the accuracy rate of physical name identification And recall rate, various language forms are applicable to, versatility is stronger.In addition, effective knowledge for the physical name in intention text Not, greatly meet individual demand in intention, and meet the identification demand of legal risk vocabulary.
Second aspect of the present invention embodiment provides a kind of identification device of physical name, including:Acquisition module, for obtaining The source-information of text to be identified and the text to be identified;First identification module, for according to the text to be identified Source-information and identification model obtain the first instance name in the text to be identified;Second identification module, for according in advance The root chart of foundation and default constraint rule obtain second instance from the text to be identified in the content of non-first instance name Name.
The identification device of the physical name of the embodiment of the present invention, obtained according to the source-information of text to be identified and identification model First instance name in text to be identified, and the second instance name in root chart and preset rules acquisition text to be identified, Statistical learning method and the advantage both rule-based recognition methods are fully combined, improves the accuracy rate of physical name identification And recall rate, various language forms are applicable to, versatility is stronger.In addition, effective knowledge for the physical name in intention text Not, greatly meet individual demand in intention, and meet the identification demand of legal risk vocabulary.
The additional aspect and advantage of the present invention will be set forth in part in the description, and will partly become from the following description Obtain substantially, or recognized by the practice of the present invention.
Brief description of the drawings
The above-mentioned and/or additional aspect and advantage of the present invention will become in the description from combination accompanying drawings below to embodiment Substantially and it is readily appreciated that, wherein:
Fig. 1 is the flow chart according to the recognition methods of the physical name of one embodiment of the invention;
Fig. 2 waits to know to be obtained according to the source-information and identification model according to text to be identified of one embodiment of the invention The flow chart of the method for first instance name in other text;
Fig. 3 is the root chart that is pre-established according to the basis of one embodiment of the invention and default constraint rule to be identified The flow chart of second instance name is obtained in text in the content of non-first instance name;
Fig. 4 is the flow chart for establishing root chart and the method for affixe table according to one embodiment of the invention;
Fig. 5 is the flow chart according to the method for establishing root identification model of one embodiment of the invention;
Fig. 6 is the flow chart according to the method for establishing entity recognition model of one embodiment of the invention;
Fig. 7 is the structural representation according to the identification device of the physical name of one embodiment of the invention;
Fig. 8 is the structural representation according to the identification device of the physical name of another embodiment of the present invention.
Embodiment
Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, is only used for explaining the present invention, and is not considered as limiting the invention.
In the description of the invention, it is to be understood that term " multiple " refers to two or more;Term " first ", " second " is only used for describing purpose, and it is not intended that instruction or hint relative importance.
Below with reference to the accompanying drawings recognition methods and the device of physical name according to embodiments of the present invention are described.
In order to reduce the consuming of the human resources of identification physical name, and recognition accuracy is improved, the present invention proposes A kind of recognition methods of physical name, including:Obtain the source-information of text to be identified and text to be identified;According to text to be identified This source-information obtains the first instance name in text to be identified;According to the root chart and default constraint rule pre-established from Second instance name is obtained in text to be identified in the content of non-first instance name.
In an embodiment of the present invention, in the entitled real world of entity any things distinguished, can recognize that title.Lift For example, for example, mechanism name, brand name, place name, name etc..
Fig. 1 is the flow chart according to the recognition methods of the physical name of one embodiment of the invention.As shown in figure 1, according to this The recognition methods of the physical name of inventive embodiments, including:
S101, obtain the source-information of text to be identified and text to be identified.
In one embodiment of the invention, the source-information of text to be identified is the exabyte for issuing text to be identified Title, web site name etc..Such as " even news reach electronic technology development corporation, Ltd. for Shenzhen ".
In an embodiment of the present invention, text to be identified is natural language text.The source-information of text to be identified can be User provides simultaneously when providing text to be identified, the acquisition that releases news when can also be issued according to text to be identified, such as issues Accounts information of person etc..Because in the accounts information of publisher mostly can include publisher obtain publisher account where or Representative mechanism.
S102, the first instance name in text to be identified is obtained according to the source-information of text to be identified and identification model.
In an embodiment of the present invention, the entitled physical name related to the source-information of text to be identified of first instance.Lift For example, in one embodiment of the invention, first instance name can be mechanism name.If for example, source of text to be identified Information is " even news reach electronic technology development corporation, Ltd. for Shenzhen ", then first instance name can be that " even news are developed up to electronic technology Co., Ltd ".
Specifically, in one embodiment of the invention, the in text to be identified can be obtained by step shown in Fig. 2 One physical name.As shown in Fig. 2 first in the source-information of text to be identified and identification model acquisition text to be identified is real The method of body name, including:
S201, the source-information of text to be identified is identified according to root identification model, to obtain text to be identified Source-information in root.
In an embodiment of the present invention, root identification model pre-establishes.More specifically, can be to text to be identified Training root identification model before being identified, the root trained also can be copied or downloaded from other storage devices and identifies mould Type.Root identification model is trained to obtain according to root chart, can recognize that the root in the source-information of text to be identified Identification model.For example, for source-information " even news reach electronic technology development corporation, Ltd. for Shenzhen ", known by root Other model may recognize that root therein " Lian Xunda ".
S202, the first instance name in text to be identified is obtained according to root and the affixe table pre-established.
In an embodiment of the present invention, affixe table is the storage table for the suffix for including multiple first instance names.For example, It may include the suffix of the physical name such as " Co., Ltd ", " electromechanical accessory factory ", " Co., Ltd " in affixe table.
In one embodiment of the invention, first instance name can be the physical name with suffix, and such as " Lian Xun reaches limited public affairs Department ", or the physical name without suffix, such as " Lian Xunda ".Therefore, the root can be searched in text to be identified first, such as Fruit is present, then the root is a first instance name in text to be identified.Then, can wait to know according to root and affixe table Root is searched in other text and is combined the character string of composition, as first instance name with any affixe in affixe table.
In another embodiment of the present invention, because many entities there may be alias, therefore, according to source-information Root possibly can not cover the physical name in text to be identified for example, " all visitors " can also be written as " VANCL ".In order to comprehensive Physical name in text to be identified is identified, the first instance name in text to be identified is obtained according to the source-information of text to be identified Method may also include in addition to including step S201-S202:
S203, text to be identified is identified according to entity recognition model, to obtain the first reality in text to be identified Body name.
In an embodiment of the present invention, entity recognition model pre-establishes.More specifically, can be to text to be identified Training entity recognition model before being identified, the Entity recognition mould trained also can be copied or downloaded from other storage devices Type.Entity recognition model obtains according to root chart and affixe table training, can recognize that the identification of the entity in text to be identified Model.For example, for " the sincere product of VANCL " in text to be identified, first can be identified as by entity recognition model Physical name.
S103, according to the root chart and default constraint rule pre-established from text to be identified in non-first instance name Second instance name is obtained in appearance.
In an embodiment of the present invention, the entitled entity related to the agency of first instance name, product, operation of second instance Name.For example, if the entitled mechanism name of first instance, second instance name can be brand name.For example, specifically, it can pass through Method shown in Fig. 3 identifies the second instance name in text to be identified, as shown in figure 3, according to the root chart pre-established and in advance If constraint rule obtains second instance name from text to be identified in the content of non-first instance name, including:
S301, the word included according to the content of non-first instance name in the root table search text to be identified pre-established Root.
S302, the root included according to default constraint rule to the content of non-first instance name in text to be identified are carried out Screening, second instance name is obtained in text to be identified in the content of non-first instance name to obtain.
In one embodiment of the invention, the root in root chart, the root of strong constraint and the word of weak constraint can be divided into Root.Wherein, refer in any case all can be as the root of physical name for the root of strong constraint, and the root of weak constraint refers to Just can be as the root of physical name when meeting certain context constraints.For example, " all visitors " is the word of strong constraint Root, " seven days " can just be used as physical name only when being combined with the affixe such as " hotel " or " holiday inn ".In the case of other " seven days " Only numeral-classifier compound.Therefore, it is necessary to establish default constraint rule for the root of weak constraint, default constraint rule be used for it is weak about The root of beam carries out term restriction so that the root of weak constraint can be used as physical name in the case where this presets constraint rule.Due to weak constraint Root type it is different, therefore, default constraint rule is also to be matched according to the root of different weak constraints, the present invention To presetting the concrete form of constraint rule without limiting.
The recognition methods of the physical name of the embodiment of the present invention, obtained according to the source-information of text to be identified and identification model First instance name in text to be identified, and the second instance name in root chart and preset rules acquisition text to be identified, Statistical learning method and the advantage both rule-based recognition methods are fully combined, improves the accuracy rate of physical name identification And recall rate, various language forms are applicable to, versatility is stronger.In addition, effective knowledge for the physical name in intention text Not, greatly meet individual demand in intention, and meet the identification demand of legal risk vocabulary.
In one embodiment of the invention, after physical name is identified, stamped according to the type of the physical name identified Corresponding label.For example, the label of mechanism name is<ORG></ORG>, the label of brand name is<BRD></BRD>.For example, " if even news reach electronic technology development corporation, Ltd. for Shenzhen " is an exabyte, but the physical name in the intention of its issue Label it is as follows:
Intention:….<BRD>Nike is gloomy</BRD>Netting twine-preferred Shenzhen<ORG>Lian Xun reaches</ORG>…
Wherein, " DataExpert reaches " is mechanism name;And " Nike is gloomy " is the ProductName that it is managed, it should be identified as brand name.
Fig. 4 is the flow chart for establishing root chart and the method for affixe table according to one embodiment of the invention.Specifically, such as Shown in Fig. 4, the method for establishing root chart and affixe table, including:
S401, collect multiple registering entities names.
In an embodiment of the present invention, registering entities name refers to fixed physical name.Such as, registered exabyte, production The name of an article, registration brand etc..
S402, multiple registering entities names are segmented respectively, to obtain multiple participles.
Wherein, to registering entities name segment any participle side that in usable correlation technique or future is likely to occur Method, the present invention is to used segmenting method without limiting.
S403, obtain the attributive character of multiple participles.
In an embodiment of the present invention, the attributive character of participle include participle part of speech, length, in whole registering entities names The features such as position of the frequency, participle of middle appearance in registering entities name.
S404, filtered out according to attributive character from multiple participles more in multiple root and affixe tables in root chart Individual affixe, to establish root chart and affixe table.
In an embodiment of the present invention, root has that occurrence frequency is high, the normal attribute such as between region word and product word Feature, and affixe has frequency height, often in attributive character such as exabyte afterbodys.Therefore, can have respectively by root and affixe Attributive character multiple roots and multiple affixes are filtered out from multiple participles.
For example, multiple roots can be filtered out from multiple participles by following rule:
A, forming the word of word can not be separated by other words;
B, word is not region word;
C, the frequency * positions of word must are fulfilled for certain threshold restriction;
D, the total length of word is necessarily less than certain length threshold value.
Multiple affixes can be filtered out from multiple participles by following rule:
A, afterbody (or afterbody in recursive structure) of the word in exabyte;
B, the frequency of occurrences of word have to be larger than certain frequency threshold value;
C, the word for forming word must is fulfilled for certain part of speech limitation.
It should be appreciated that above-mentioned rule is exemplary only, and in other embodiments of the invention, those skilled in the art Can according in foregoing description it is unrequited go out other roots and the attributive character of affixe set the screening rule of root and affixe.
In one embodiment of the invention, due to the wide variety of physical name, therefore, the data volume of root chart is very huge Greatly, in order to improve inquiry velocity when using root chart, compressed index is established to root chart, for example, for identical The root of prefix, a common index can be established according to their identical prefixes, so as to improve search efficiency.In addition, such as Previous embodiment, root is divided into the root of strong constraint and the root of weak constraint, and therefore, root chart can distinguish strong root chart and weak Root chart.
Fig. 5 is the flow chart according to the method for establishing root identification model of one embodiment of the invention.Specifically, as schemed Shown in 5, the method for establishing root identification model, including:
S501, obtain the first training corpus.
In an embodiment of the present invention, the first training corpus is the language material for training root identification model.Specifically, may be used A small amount of physical name is extracted in the hit of fixed entity, for example, 1000 physical names can be extracted, then by extracting 1000 physical names carry out artificial check and correction and obtain the first training corpus, you can make the recognition accuracy of identification model trained Reach more than 95%.Because the physical name required for obtaining the first training corpus is seldom, the workload manually proofreaded is also very small, Only need a few minutes can to complete, greatly save manpower and time, and accuracy rate is higher.
S502, fisrt feature template is built according to the word feature of the first training corpus.
In an embodiment of the present invention, for each word in the physical name in the first training corpus, extract word in itself and Its category feature of part of speech two, then, two category features of different words in the first training corpus are combined, obtain having first to preset The fisrt feature template of the characteristic item of quantity.
S503, root identification model is trained according to fisrt feature template and conditional random field models.
Wherein, conditional random field models are a kind of prejudgementing character models, can pass through defined label sequence and the bar of observation sequence Part probability predicts most probable flag sequence.Therefore, in an embodiment of the present invention, using conditional random field models, root Root identification model is obtained according to the fisrt feature template of the feature for meeting root of structure.
Fig. 6 is the flow chart according to the method for establishing entity recognition model of one embodiment of the invention.Specifically, as schemed Shown in 6, the method for establishing entity recognition model, including:
S601, the second training corpus is obtained according to root chart and affixe table.
In an embodiment of the present invention, the second training corpus can utilize root chart and affixe table to construct and form automatically, specifically Ground, a large amount of intention fragments are segmented first and part of speech identification after, carry out canonical matching using root and affixe table, then Call format will be met (such as:Without stop words, without interval, length in threshold value etc.) root+affixe most long matching string conduct One mechanism name with suffix.Wherein, in the result that matching obtains after terminating, following four kinds of situations can be divided into:
1st, the intention fragment of " root+affixe " is included;Such as:Beijing dawn (root) hospital of andrology (affixe) possesses online money Deep expert.
2nd, the intention fragment of " root " is only included, such as:Beijing army all (root) five chamber ion peptide therapies of employing new technology are controlled Treat.
3rd, the intention fragment of " affixe " is only included, such as:It is good to treat which hospital of prostatitis (affixe)
4th, the intention fragment that root and affixe all do not include, such as:The not oral of that do not have an injection do not operate on no pains.
In above-mentioned four kinds of situations, first two contains entity, and this is referred to as " positive example ";And latter two does not include entity, quilt Referred to as " counter-example ".Because the intention fragment that an intention includes may have entity, it is possible to without entity, thus be accordingly used in training Should be to include positive example also to include counter-example in second training corpus of entity recognition model, the model otherwise trained has partially Difference.Wherein, positive counter-example number need to meet certain proportion, in one embodiment of the invention, according in intention include entity and The distribution of the intention fragment of entity is not included, can set in the second training corpus the number of positive example and counter-example ratio as 1:3.
S602, second feature template is built according to the word feature of the second training corpus.
In an embodiment of the present invention, for each word in the second training corpus, word is extracted in itself and its part of speech, position Put, the category feature of length four, then, four category features of different words in the second training corpus are combined, obtain pre- with second If the second feature template of the characteristic item of quantity.
S603, entity recognition model is trained according to second feature template and conditional random field models.
Wherein, conditional random field models are a kind of prejudgementing character models, can pass through defined label sequence and the bar of observation sequence Part probability predicts most probable flag sequence.Therefore, in an embodiment of the present invention, using conditional random field models, root Entity recognition model is obtained according to the second feature template of the feature for meeting physical name of structure.
From the embodiment shown in Fig. 4, Fig. 5, Fig. 6, in the recognition methods of the physical name of the embodiment of the present invention, language material Training, the training of identification model, the foundation of root chart and affixe table can almost perform automatically, although being used to instruct obtaining , it is necessary to manually proofread when practicing the first training corpus of root identification model, but required manpower and time are considerably less, to artificial Dependence is extremely low, so as to greatly reduce the consumption of human and material resources resource, saves the time.
In order to realize above-described embodiment, the present invention also proposes a kind of identification device of physical name.
A kind of identification device of physical name, including:Acquisition module, for obtaining text to be identified and text to be identified Source-information;First identification module, obtained for the source-information according to text to be identified and identification model in text to be identified First instance name;Second identification module, for according to the root chart that pre-establishes and default constraint rule from text to be identified In non-first instance name content in obtain second instance name.
Fig. 7 is the structural representation according to the identification device of the physical name of one embodiment of the invention.
As shown in fig. 7, the identification device of physical name according to embodiments of the present invention, including:Acquisition module 10, first identifies The identification module 30 of module 20 and second.
Specifically, acquisition module 10 is used for the source-information for obtaining text to be identified and text to be identified.In the present invention One embodiment in, the source-information of text to be identified is issues the Business Name of text to be identified, web site name etc..Such as " even news reach electronic technology development corporation, Ltd. for Shenzhen ".
In an embodiment of the present invention, text to be identified is natural language text.The source-information of text to be identified can be User provides simultaneously when providing text to be identified, the acquisition that releases news when can also be issued according to text to be identified, such as issues Accounts information of person etc..Because in the accounts information of publisher mostly can include publisher obtain publisher account where or Representative mechanism.
First identification module 20 is used to be obtained in text to be identified according to the source-information and identification model of text to be identified First instance name.In an embodiment of the present invention, the entitled entity related to the source-information of text to be identified of first instance Name.For example, in one embodiment of the invention, first instance name can be mechanism name.If for example, text to be identified Source-information is " even news reach electronic technology development corporation, Ltd. for Shenzhen ", then first instance name can be that " even news reach electronic technology Development corporation, Ltd. ".
More specifically, in one embodiment of the invention, the first identification module 20 is specifically used for identifying mould according to root The source-information of text to be identified is identified type, to obtain the root in the source-information of text to be identified, and according to word Root and the affixe table pre-established obtain the first instance name in text to be identified.
In an embodiment of the present invention, root identification model pre-establishes.More specifically, can be to text to be identified Training root identification model before being identified, the root trained also can be copied or downloaded from other storage devices and identifies mould Type.Root identification model is trained to obtain according to root chart, can recognize that the root in the source-information of text to be identified Identification model.For example, for source-information " even news reach electronic technology development corporation, Ltd. for Shenzhen ", known by root Other model may recognize that root therein " Lian Xunda ".In an embodiment of the present invention, affixe table is to include multiple first instances The storage table of the suffix of name.For example, " Co., Ltd ", " electromechanical accessory factory ", " Limited Liability public affairs are may include in affixe table The suffix of the physical names such as department ".
In one embodiment of the invention, first instance name can be the physical name with suffix, and such as " Lian Xun reaches limited public affairs Department ", or the physical name without suffix, such as " Lian Xunda ".Therefore, the first identification module 20 can be first in text to be identified The root is searched, if it is present the root is a first instance name in text to be identified.Then, the first identification mould Block 20 can search root in text to be identified according to root and affixe table and be combined the word of composition with any affixe in affixe table Symbol string, as first instance name.
In another embodiment of the present invention, because many entities there may be alias, therefore, according to source-information Root possibly can not cover the physical name in text to be identified for example, " all visitors " can also be written as " VANCL ".In order to comprehensive Physical name in text to be identified is identified, the first identification module 20 can also be used to text to be identified is carried out according to entity recognition model Identification, to obtain the first instance name in text to be identified.Wherein, entity recognition model pre-establishes.More specifically, can Entity recognition model is trained before text to be identified is identified, also can copy or download from other storage devices and train Good entity recognition model.Entity recognition model obtains according to root chart and affixe table training, can recognize that text to be identified In entity identification model.For example, for " the sincere product of VANCL " in text to be identified, it is by entity recognition model First instance name can be identified as.
Second identification module 30 is used for non-from text to be identified according to the root chart and default constraint rule that pre-establish Second instance name is obtained in the content of first instance name.In an embodiment of the present invention, second instance is entitled with first instance name Agency, product, manage related physical name.For example, if the entitled mechanism name of first instance, second instance name can be Brand name.
More specifically, the second identification module 30 is specifically used for according to non-in the root table search text to be identified pre-established The root that the content of first instance name is included, and according to default constraint rule in text to be identified in non-first instance name Hold included root to be screened, second instance name is obtained in text to be identified in the content of non-first instance name to obtain. In one embodiment of the invention, the root in root chart, the root of strong constraint and the root of weak constraint can be divided into.Wherein, The root of strong constraint refers in any case all can be as the root of physical name, and the root of weak constraint refers to meeting necessarily Context constraints when just can be as the root of physical name.For example, " all visitors " is the root of strong constraint, " seven days " Only when being combined with the affixe such as " hotel " or " holiday inn ", physical name can be just used as." seven days " are only quantity in the case of other Word.Therefore, it is necessary to establish default constraint rule for the root of weak constraint, default constraint rule is for the root to weak constraint Term restriction is carried out so that the root of weak constraint can be used as physical name in the case where this presets constraint rule.Due to the root class of weak constraint Type is different, and therefore, default constraint rule is also to be matched according to the root of different weak constraints, and the present invention is to presetting about The concrete form of beam rule is without limiting.
The identification device of the physical name of the embodiment of the present invention, obtained according to the source-information of text to be identified and identification model First instance name in text to be identified, and the second instance name in root chart and preset rules acquisition text to be identified, Statistical learning method and the advantage both rule-based recognition methods are fully combined, improves the accuracy rate of physical name identification And recall rate, various language forms are applicable to, versatility is stronger.In addition, effective knowledge for the physical name in intention text Not, greatly meet individual demand in intention, and meet the identification demand of legal risk vocabulary.
In one embodiment of the invention, after physical name is identified, stamped according to the type of the physical name identified Corresponding label.For example, the label of mechanism name is<ORG></ORG>, the label of brand name is<BRD></BRD>.For example, " if even news reach electronic technology development corporation, Ltd. for Shenzhen " is an exabyte, but the physical name in the intention of its issue Label it is as follows:
Intention:….<BRD>Nike is gloomy</BRD>Netting twine-preferred Shenzhen<ORG>Lian Xun reaches</ORG>…
Wherein, " DataExpert reaches " is mechanism name;And " Nike is gloomy " is the ProductName that it is managed, it should be identified as brand name.
Fig. 8 is the structural representation according to the identification device of the physical name of another embodiment of the present invention.As shown in figure 8, The identification device of the physical name includes:Acquisition module 10, the first identification module 20, the second identification module 30, vocabulary establish module 40th, the first model training module 50 and the second model training module 60.
Specifically, vocabulary is established module 40 and is used for:
Multiple registering entities names are collected, wherein, registering entities name refers to fixed physical name.Such as, registered company Name, ProductName, registration brand etc.;
Multiple registering entities names are segmented respectively, to obtain multiple participles, wherein, registering entities name is segmented Any segmenting method that in correlation technique or future is likely to occur can be used, the present invention is to used segmenting method without limit It is fixed;
The attributive character of multiple participles is obtained, wherein, the attributive character of participle includes the part of speech of participle, length, in whole The features such as position of the frequency, participle occurred in registering entities name in registering entities name;
Multiple words in multiple root and affixe tables in root chart are filtered out from multiple participles according to attributive character Sew, to establish root chart and affixe table.
In an embodiment of the present invention, root has that occurrence frequency is high, the normal attribute such as between region word and product word Feature, and affixe has frequency height, often in attributive character such as exabyte afterbodys.Therefore, can have respectively by root and affixe Attributive character multiple roots and multiple affixes are filtered out from multiple participles.
For example, multiple roots can be filtered out from multiple participles by following rule:
A, forming the word of word can not be separated by other words;
B, word is not region word;
C, the frequency * positions of word must are fulfilled for certain threshold restriction;
D, the total length of word is necessarily less than certain length threshold value.
Multiple affixes can be filtered out from multiple participles by following rule:
A, afterbody (or afterbody in recursive structure) of the word in exabyte;
B, the frequency of occurrences of word have to be larger than certain frequency threshold value;
C, the word for forming word must is fulfilled for certain part of speech limitation.
It should be appreciated that above-mentioned rule is exemplary only, and in other embodiments of the invention, those skilled in the art Can according in foregoing description it is unrequited go out other roots and the attributive character of affixe set the screening rule of root and affixe.
In one embodiment of the invention, due to the wide variety of physical name, therefore, the data volume of root chart is very huge Greatly, in order to improve inquiry velocity when using root chart, compressed index is established to root chart, for example, for identical The root of prefix, a common index can be established according to their identical prefixes, so as to improve search efficiency.In addition, such as Previous embodiment, root is divided into the root of strong constraint and the root of weak constraint, and therefore, root chart can distinguish strong root chart and weak Root chart.
First model training module 50 is used for:
The first training corpus is obtained, wherein, the first training corpus is the language material for training root identification model.Specifically Ground, it can be hit in fixed entity and extract a small amount of physical name, for example, 1000 physical names can be extracted, then passed through The artificial check and correction of 1000 physical names progress to extraction obtains the first training corpus, you can makes the identification of identification model trained Rate of accuracy reached is to more than 95%.Due to the physical name work that is seldom, therefore manually proofreading required for the first training corpus of acquisition Amount is also very small, it is only necessary to which a few minutes can is completed, and greatlys save manpower and time, and accuracy rate is higher.;
Fisrt feature template is built according to the word feature of the first training corpus, wherein, in the first training corpus Each word in physical name, word is extracted in itself and its category feature of part of speech two, then, by two classes of different words in the first training corpus Feature is combined, and obtains the fisrt feature template of the characteristic item with the first predetermined number;
According to fisrt feature template and conditional random field models training root identification model, wherein, conditional random field models It is a kind of prejudgementing character model, most probable mark sequence can be predicted by defined label sequence and the conditional probability of observation sequence Row.Therefore, in an embodiment of the present invention, using conditional random field models, according to the of the feature for meeting root of structure One feature templates obtain root identification model.
Second model training module 60 is used for:
Second training corpus is obtained according to root chart and affixe table, wherein, in the result that matching obtains after terminating, Ke Yifen For following four kinds of situations:
1st, the intention fragment of " root+affixe " is included;Such as:Beijing dawn (root) hospital of andrology (affixe) possesses online money Deep expert.
2nd, the intention fragment of " root " is only included, such as:Beijing army all (root) five chamber ion peptide therapies of employing new technology are controlled Treat.
3rd, the intention fragment of " affixe " is only included, such as:It is good to treat which hospital of prostatitis (affixe)
4th, the intention fragment that root and affixe all do not include, such as:The not oral of that do not have an injection do not operate on no pains.
In above-mentioned four kinds of situations, first two contains entity, and this is referred to as " positive example ";And latter two does not include entity, quilt Referred to as " counter-example ".Because the intention fragment that an intention includes may have entity, it is possible to without entity, thus be accordingly used in training Should be to include positive example also to include counter-example in second training corpus of entity recognition model, the model otherwise trained has partially Difference.Wherein, positive counter-example number need to meet certain proportion, in one embodiment of the invention, according in intention include entity and The distribution of the intention fragment of entity is not included, can set in the second training corpus the number of positive example and counter-example ratio as 1:3;
Second feature template is built according to the word feature of the second training corpus, wherein, in the second training corpus Each word, word is extracted in itself and its part of speech, position, the category feature of length four, then, by four classes of different words in the second training corpus Feature is combined, and obtains the second feature template of the characteristic item with the second predetermined number;
According to second feature template and conditional random field models training entity recognition model, wherein, conditional random field models It is a kind of prejudgementing character model, most probable mark sequence can be predicted by defined label sequence and the conditional probability of observation sequence Row.Therefore, in an embodiment of the present invention, using conditional random field models, according to the feature for meeting physical name of structure Second feature template obtains entity recognition model.
The identification device of the physical name of the embodiment of the present invention, the training of language material, the training of identification model, root chart and word Sewing the foundation of table can almost perform automatically, although, obtain be used for train root identification model the first training corpus when, it is necessary to Artificial check and correction, but required manpower and time are considerably less, it is extremely low to artificial dependence, so as to greatly reduce manpower, thing The consumption of power resource, the time is saved, and accuracy rate is higher.
Any process or method described otherwise above description in flow chart or herein is construed as, and represents to include Module, fragment or the portion of the code of the executable instruction of one or more the step of being used to realize specific logical function or process Point, and the scope of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discuss suitable Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, carry out perform function, this should be of the invention Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system including the system of processor or other can be held from instruction The system of row system, device or equipment instruction fetch and execute instruction) use, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium " can any can be included, store, communicate, propagate or pass Defeated program is for instruction execution system, device or equipment or the dress used with reference to these instruction execution systems, device or equipment Put.The more specifically example (non-exhaustive list) of computer-readable medium includes following:Electricity with one or more wiring Connecting portion (electronic installation), portable computer diskette box (magnetic device), random access memory (RAM), read-only storage (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device, and portable optic disk is read-only deposits Reservoir (CDROM).In addition, computer-readable medium, which can even is that, to print the paper of described program thereon or other are suitable Medium, because can then enter edlin, interpretation or if necessary with it for example by carrying out optical scanner to paper or other media His suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.Above-mentioned In embodiment, software that multiple steps or method can be performed in memory and by suitable instruction execution system with storage Or firmware is realized.If, and in another embodiment, can be with well known in the art for example, realized with hardware Any one of row technology or their combination are realized:With the logic gates for realizing logic function to data-signal Discrete logic, have suitable combinational logic gate circuit application specific integrated circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..
Those skilled in the art are appreciated that to realize all or part of step that above-described embodiment method carries Suddenly it is that by program the hardware of correlation can be instructed to complete, described program can be stored in a kind of computer-readable storage medium In matter, the program upon execution, including one or a combination set of the step of embodiment of the method.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, can also That unit is individually physically present, can also two or more units be integrated in a module.Above-mentioned integrated mould Block can both be realized in the form of hardware, can also be realized in the form of software function module.The integrated module is such as Fruit is realized in the form of software function module and as independent production marketing or in use, can also be stored in a computer In read/write memory medium.
Storage medium mentioned above can be read-only storage, disk or CD etc..
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or the spy for combining the embodiment or example description Point is contained at least one embodiment or example of the present invention.In this manual, to the schematic representation of above-mentioned term not Necessarily refer to identical embodiment or example.Moreover, specific features, structure, material or the feature of description can be any One or more embodiments or example in combine in an appropriate manner.
Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that:Not In the case of departing from the principle and objective of the present invention a variety of change, modification, replacement and modification can be carried out to these embodiments, this The scope of invention is by claim and its equivalent limits.

Claims (12)

  1. A kind of 1. recognition methods of physical name, it is characterised in that including:
    Obtain the source-information of text to be identified and the text to be identified;
    The first instance name in the text to be identified is obtained according to the source-information of the text to be identified and identification model;
    According to the root chart and default constraint rule pre-established from the text to be identified in the content of non-first instance name Obtain second instance name;
    Wherein, first in the text to be identified according to the acquisition of the source-information and identification model of the text to be identified Physical name, specifically include:
    The source-information of the text to be identified is identified according to root identification model, to obtain the text to be identified Root in source-information;
    The first instance name in the text to be identified is obtained according to the root and the affixe table pre-established;
    Wherein, root chart and default the constraint rule non-first instance name from the text to be identified that the basis pre-establishes Content in obtain second instance name, specifically include:
    The word included according to the content of non-first instance name in text to be identified described in the root table search pre-established Root;
    The root included to the content of non-first instance name in the text to be identified screens;
    If the root that the content of non-first instance name is included in the text to be identified is the root of strong constraint, directly obtain Second instance name is taken, wherein, the root of strong constraint refers to the root that can act as physical name in any case;
    If the root that the content of non-first instance name is included in the text to be identified is the root of weak constraint, according to institute State default constraint rule and obtain the second instance name, wherein, the root of weak constraint refers to meeting certain context constraint Could be as the root of physical name during condition.
  2. 2. the method as described in claim 1, it is characterised in that
    The entitled mechanism name of first instance;
    The entitled brand name of second instance.
  3. 3. the method as described in claim 1, it is characterised in that also include:
    The text to be identified is identified according to entity recognition model, to obtain the first instance in the text to be identified Name.
  4. 4. the method as described in claim 1, it is characterised in that in acquisition text to be identified and the text to be identified Source-information before, in addition to:
    Collect multiple registering entities names;
    The multiple registering entities name is segmented respectively, to obtain multiple participles;
    Obtain the attributive character of the multiple participle;
    Multiple roots and the affixe in the root chart are filtered out from the multiple participle according to the attributive character Multiple affixes in table, to establish the root chart and the affixe table.
  5. 5. the method as described in claim 1, it is characterised in that also include:
    Obtain the first training corpus;
    Fisrt feature template is built according to the word feature of first training corpus;
    The root identification model is trained according to the fisrt feature template and conditional random field models.
  6. 6. method as claimed in claim 3, it is characterised in that also include:
    Second training corpus is obtained according to the root chart and the affixe table;
    Second feature template is built according to the word feature of second training corpus;
    The entity recognition model is trained according to the second feature template and conditional random field models.
  7. A kind of 7. identification device of physical name, it is characterised in that including:
    Acquisition module, for obtaining the source-information of text to be identified and the text to be identified;
    First identification module, the text to be identified is obtained for the source-information according to the text to be identified and identification model In first instance name;
    Second identification module, for according to the root chart that pre-establishes and default constraint rule non-the from the text to be identified Second instance name is obtained in the content of one physical name;
    Wherein, first identification module is specifically used for:
    The source-information of the text to be identified is identified according to root identification model, to obtain the text to be identified Root in source-information;
    The first instance name in the text to be identified is obtained according to the root and the affixe table pre-established;
    Wherein, the second instance module is specifically used for:
    The word included according to the content of non-first instance name in text to be identified described in the root table search pre-established Root;
    The root included to the content of non-first instance name in the text to be identified screens;
    If the root that the content of non-first instance name is included in the text to be identified is the root of strong constraint, directly obtain Second instance name is taken, wherein, the root of strong constraint refers to the root that can act as physical name in any case;
    If the root that the content of non-first instance name is included in the text to be identified is the root of weak constraint, according to institute State default constraint rule and obtain the second instance name, wherein, the root of weak constraint refers to meeting certain context constraint Could be as the root of physical name during condition.
  8. 8. device as claimed in claim 7, it is characterised in that
    The entitled mechanism name of first instance;
    The entitled brand name of second instance.
  9. 9. device as claimed in claim 7, it is characterised in that first identification module is additionally operable to according to entity recognition model The text to be identified is identified, to obtain the first instance name in the text to be identified.
  10. 10. device as claimed in claim 7, it is characterised in that also establish module including vocabulary, the vocabulary establishes module use In:
    Collect multiple registering entities names;
    The multiple registering entities name is segmented respectively, to obtain multiple participles;
    Obtain the attributive character of the multiple participle;
    Multiple roots and the affixe in the root chart are filtered out from the multiple participle according to the attributive character Multiple affixes in table, to establish the root chart and the affixe table.
  11. 11. device as claimed in claim 7, it is characterised in that also including the first model training module, the first model instruction Practice module to be used for:
    Obtain the first training corpus;
    Fisrt feature template is built according to the word feature of first training corpus;
    The root identification model is trained according to the fisrt feature template and conditional random field models.
  12. 12. device as claimed in claim 9, it is characterised in that also including the second model training module, the second model instruction Practice module to be used for:
    Second training corpus is obtained according to the root chart and the affixe table;
    Second feature template is built according to the word feature of second training corpus;
    The entity recognition model is trained according to the second feature template and conditional random field models.
CN201410234622.0A 2014-05-29 2014-05-29 The recognition methods of physical name and device Active CN103995885B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410234622.0A CN103995885B (en) 2014-05-29 2014-05-29 The recognition methods of physical name and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410234622.0A CN103995885B (en) 2014-05-29 2014-05-29 The recognition methods of physical name and device

Publications (2)

Publication Number Publication Date
CN103995885A CN103995885A (en) 2014-08-20
CN103995885B true CN103995885B (en) 2017-11-17

Family

ID=51310050

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410234622.0A Active CN103995885B (en) 2014-05-29 2014-05-29 The recognition methods of physical name and device

Country Status (1)

Country Link
CN (1) CN103995885B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550372A (en) * 2016-01-28 2016-05-04 浪潮软件集团有限公司 Sentence training device and method and information extraction system
CN106503192B (en) * 2016-10-31 2019-10-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
CN108241621B (en) * 2016-12-23 2019-12-10 北京国双科技有限公司 legal knowledge retrieval method and device
CN107943786B (en) * 2017-11-16 2021-12-07 广州市万隆证券咨询顾问有限公司 Chinese named entity recognition method and system
CN108595430B (en) * 2018-04-26 2022-02-22 携程旅游网络技术(上海)有限公司 Aviation transformer information extraction method and system
CN108829681B (en) * 2018-06-28 2022-11-11 鼎富智能科技有限公司 Named entity extraction method and device
CN111178073B (en) * 2018-10-23 2024-06-04 北京嘀嘀无限科技发展有限公司 Text processing method, device, electronic equipment and storage medium
CN109582975B (en) * 2019-01-31 2023-05-23 北京嘉和海森健康科技有限公司 Named entity identification method and device
CN110750991B (en) * 2019-09-18 2022-04-15 平安科技(深圳)有限公司 Entity identification method, device, equipment and computer readable storage medium
CN110688841A (en) * 2019-09-30 2020-01-14 广州准星信息科技有限公司 Mechanism name identification method, mechanism name identification device, mechanism name identification equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901235A (en) * 2009-05-27 2010-12-01 国际商业机器公司 Method and system for document processing
CN102103594A (en) * 2009-12-22 2011-06-22 北京大学 Character data recognition and processing method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514165A (en) * 2012-06-15 2014-01-15 佳能株式会社 Method and device for identifying persons mentioned in conversation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901235A (en) * 2009-05-27 2010-12-01 国际商业机器公司 Method and system for document processing
CN102103594A (en) * 2009-12-22 2011-06-22 北京大学 Character data recognition and processing method and device

Also Published As

Publication number Publication date
CN103995885A (en) 2014-08-20

Similar Documents

Publication Publication Date Title
CN103995885B (en) The recognition methods of physical name and device
CN106649783B (en) Synonym mining method and device
CN104679850B (en) Address structure method and device
CN107609052A (en) A kind of generation method and device of the domain knowledge collection of illustrative plates based on semantic triangle
CN102253930B (en) A kind of method of text translation and device
CN104281702B (en) Data retrieval method and device based on electric power critical word participle
CN104156352A (en) Method and system for handling Chinese event
CN104679867B (en) Address method of knowledge processing and device based on figure
CN105095190B (en) A kind of sentiment analysis method combined based on Chinese semantic structure and subdivision dictionary
CN104809142A (en) Trademark inquiring system and method
CN101178705A (en) Free-running speech comprehend method and man-machine interactive intelligent system
CN109145047A (en) Configuration method, data processing equipment and the storage medium of user tag portrait
CN103324626A (en) Method for setting multi-granularity dictionary and segmenting words and device thereof
CN101452443B (en) Recording medium for recording logical structure model creation assistance program, logical structure model creation assistance device and logical structure model creation assistance method
CN110008473B (en) Medical text named entity identification and labeling method based on iteration method
CN103927299A (en) Method for providing candidate sentences in input method and method and device for recommending input content
US7853595B2 (en) Method and apparatus for creating a tool for generating an index for a document
CN102122280A (en) Method and system for intelligently extracting content object
CN108647199A (en) A kind of discovery method of place name neologisms
CN109408806A (en) A kind of Event Distillation method based on English grammar rule
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN110929007A (en) Electric power marketing knowledge system platform and application method
CN108304424A (en) Text key word extracting method and text key word extraction element
CN103440343B (en) Knowledge base construction method facing domain service target
Chuang et al. Context-aware wrapping: Synchronized data extraction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant