CN109902303A

CN109902303A - A kind of entity recognition method and relevant device

Info

Publication number: CN109902303A
Application number: CN201910158600.3A
Authority: CN
Inventors: 林浚玮; 邵轶男; 王巨宏; 陈伟
Original assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2019-03-01
Filing date: 2019-03-01
Publication date: 2019-06-18
Anticipated expiration: 2039-03-01
Also published as: CN109902303B

Abstract

The embodiment of the invention discloses a kind of entity recognition method and relevant devices, comprising: obtains a plurality of mark corpus first, every mark corpus carries markup information in a plurality of mark corpus；Hypergraph model is established then according to preset entity mark rule；Then the corresponding mark path profile of every mark corpus is determined according to markup information and entity mark rule and training pattern is waited for according to hypergraph model and preset Establishment of Neural Model；Finally by mark path profile input to be trained in training pattern, entity recognition model is obtained, and according to entity recognition model, at least one of identification input corpus name entity.Using the embodiment of the present invention, the entity of nested structure can be effectively identified, to improve the accuracy of Entity recognition and entity extraction.

Description

A kind of entity recognition method and relevant device

Technical field

The present invention relates to technical field of information processing more particularly to a kind of entity recognition methods and relevant device.

Background technique

In the information explosion epoch, required information how is fast and effeciently extracted from mass data as hot research Project, and thus caused the research to natural language processing.All the time, entity extracts task in natural language processing field By extensive concern, under it is the previous step of many natural language processing tasks, therefore its performance also directly affects Swim the performance, such as entity connection, entity relationship classification, knowledge mapping reasoning etc. of natural language processing task.Wherein, entity is Entity is named, it refers to name, mechanism name, place name and other all entities with entitled mark in natural language, more Extensive entity can also include number, date, currency, address etc..In entity extraction task, it may occur that entity weight The folded phenomenon nested with entity, as shown in fig. 1 on the left-hand side, character string X1X2X3 is labeled as name entity (PER), character string X2X3X4 It is labeled as place name entity (GPE), the two overlaps (X2X3).For another example shown in the right side Fig. 1, character string X1X2 is labeled as PER, Character string X1X2X3X4 is labeled as GPE, and X1X2 character string is the substring of X1X2X3X4 character string, belongs to nested structure.Currently, needle Task is extracted to entity, the extraction model of mainstream is conditional random field models and neural network-conditional random field models, such mould Type can not directly handle nested structure, and nested Entity recognition, but multiple models can only be completed in such a way that multiple models are superimposed The mode of superposition again will be mutually indepedent because of each conditional random field models, and can not effectively capture the dependence between entity, lead It is low that the performance of cause Entity recognition is poor, entity extracts accuracy rate.

Summary of the invention

The present invention provides a kind of entity recognition method and relevant device, can effectively identify the entity of nested structure, thus Improve the accuracy of Entity recognition and entity extraction.

In a first aspect, the embodiment of the invention provides a kind of entity recognition methods, comprising:

A plurality of mark corpus is obtained, every mark corpus carries markup information in a plurality of mark corpus；

Hypergraph model is established according to preset entity mark rule；

According to the markup information and entity mark rule, the corresponding mark path of every mark corpus is determined Figure；

According to the hypergraph model and preset neural network model, establish to training pattern；

Mark path profile input is described to be trained in training pattern, obtain entity recognition model；

According to the entity recognition model, at least one of identification input corpus name entity.

Second aspect, the embodiment of the invention provides a kind of entity recognition devices, comprising:

Module is obtained, for obtaining a plurality of mark corpus, every mark corpus carries mark in a plurality of mark corpus Information；

Modeling module, for establishing hypergraph model according to preset entity mark rule；

Labeling module determines every mark corpus for marking rule according to the markup information and the entity Corresponding mark path profile；

The modeling module is also used to establish mould to be trained according to the hypergraph model and preset neural network model Type；

Training module, for, to be trained in training pattern, entity knowledge will to be obtained described in mark path profile input Other model；

Identification module, for according to the entity recognition model, at least one of identification input corpus name entity.

The third aspect, the embodiment of the invention provides a kind of Entity recognition equipment, comprising: processor, memory and communication Bus, wherein for realizing connection communication between processor and memory, processor executes to be stored in memory communication bus The step in a kind of entity recognition method that program provides for realizing above-mentioned first aspect.

In a possible design, Entity recognition equipment provided by the invention be may include for executing in the above method The corresponding module of behavior.Module can be software and/or hardware.

The another aspect of the embodiment of the present invention provides a kind of computer readable storage medium, the computer-readable storage A plurality of instruction is stored in medium, described instruction is suitable for being loaded as processor and executing method described in above-mentioned various aspects.

The another aspect of the embodiment of the present invention provides a kind of computer program product comprising instruction, when it is in computer When upper operation, so that computer executes method described in above-mentioned various aspects.

Implement the embodiment of the present invention, obtains a plurality of mark corpus, every mark corpus in a plurality of mark corpus first Carry markup information；Hypergraph model is established then according to preset entity mark rule；Then according to markup information and entity mark Note rule, determines the corresponding mark path profile of every mark corpus and according to hypergraph model and preset neural network model, It establishes to training pattern；Mark path profile input is finally obtained into entity recognition model, and root to be trained in training pattern According to entity recognition model, at least one of identification input corpus name entity.The entity that can effectively identify nested structure, from And improve the accuracy of Entity recognition and entity extraction.

Detailed description of the invention

Technical solution in order to illustrate the embodiments of the present invention more clearly or in background technique below will be implemented the present invention Attached drawing needed in example or background technique is illustrated.

Fig. 1 is the schematic diagram of the nested entity that background technique provides and overlapping entity；

Fig. 2 is a kind of structural schematic diagram of information extraction system provided in an embodiment of the present invention；

Fig. 3 is a kind of flow diagram of entity recognition method provided in an embodiment of the present invention；

Fig. 4 is a kind of structural schematic diagram of full connection hypergraph model provided in an embodiment of the present invention；

Fig. 5 is a kind of structural schematic diagram for marking path profile provided in an embodiment of the present invention；

Fig. 6 is a kind of structural schematic diagram to training pattern provided in an embodiment of the present invention；

Fig. 7 is the flow diagram of another entity recognition method provided in an embodiment of the present invention；

Fig. 8 is a kind of structural schematic diagram of entity recognition device provided in an embodiment of the present invention；

Fig. 9 is a kind of structural schematic diagram of Entity recognition equipment provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.

Fig. 2 is referred to, Fig. 2 is a kind of structural schematic diagram of information extraction system provided in an embodiment of the present invention.The information It include information processing equipment, database and other equipment in extraction system.Wherein, information processing equipment can be computer, hand Machine and server (such as database server, file server) store a large amount of voice messaging and text in database Information etc., the database can be the local data base of information processing equipment, and being also possible to other allows the information processing equipment The database of access.Information processing equipment can obtain information also from database and can receive the information of other equipment transmission, And entity extraction is carried out to the information that obtains or receive, so as to execute subsequent information handling task (such as knowledge mapping reasoning, Entity connection and entity relationship classification) or to other equipment pushed information etc., wherein it is mechanical, electrical that other equipment are also possible to hand Brain and server etc..Information processing equipment can identify first obtain or the information that receives in name entity, then from identification To a variety of name entities in extract required name entity.Based on above system, the embodiment of the invention provides following entities Recognition methods.

Fig. 3 is referred to, Fig. 3 is a kind of flow diagram of entity recognition method provided in an embodiment of the present invention, this method Including but not limited to following steps:

S301, obtains a plurality of mark corpus, and a plurality of mark corpus carries markup information.

In the specific implementation, mark corpus can be the sentence of random length, such as " He talked to the U.S.president ", " office worker that Zhang San is XXX company " etc..The markup information of every mark corpus includes in the mark corpus The mark label of each word/word.It whether is assured that in corpus according to mark label comprising naming entity and being included Name entity type, in order to below the terseness of description will name entity be referred to as entity.Entity type can with but it is unlimited In including name entity and place name entity, mark label can be number, character and character string, wherein the mark of name entity Label can be, but not limited to can be, but not limited to for the mark label of PER, place name entity as GPE.For example, mark corpus " He The mark situation of talked to the U.S.president " is as follows:

Wherein, " He " is name entity, and corresponding mark label is PER；" talked " and " to " is neither name entity It is also not place name entity, corresponding mark label is " O "；" the U.S.president " is name entity, corresponding mark mark Label are PER；And " U.S. " is place name entity, corresponding mark label is " GPE ".

S302 establishes hypergraph model according to preset entity mark rule.

In the specific implementation, hypergraph model is the extensive of traditional graph model, each edge in hypergraph model can connect two And more than two nodes, wherein more than two nodes are connected in hypergraph model while commonly referred to as super.And in traditional artwork In type, each edge can only at most connect two nodes, and in traditional condition random field graph model, each edge can only be connected One node.

In order to mark nested entity, the embodiment of the present invention proposes a kind of special hypergraph model, includes in the hypergraph model Multiple father nodes, each father node correspond to a plurality of types of child nodes.Input word on the corresponding time step of each father node, Wherein, include the sentence of N number of word/word for one, one of word/word can be inputted on each time step, then often Word/the word inputted on a time step is the input word on the time step, and a time step can be regarded as a default length The period of degree, such as 0.1 millisecond.Wherein, each father node can be, but not limited to the child node of corresponding seven seed types, in the present invention A is used in embodiment_kIndicate the corresponding father node of k-th of time step, then:

A_kIndicate the entity since k location or later position, described k location/position k is understood that herein and below For position of the input word in the corpus where the input word on k-th of time step；

A_kThe first seed type child node E_kIndicate that left margin is located at the entity of k location；

A_kSecond of type child nodeIndicate the entity that left margin is located at k location and type is j；

A_kThird seed type child nodeIndicate the entity that the type since k location is j；

A_kThe 4th seed type child nodeIndicate covering position k and type for the entity of j；

A_kThe 5th seed type child nodeIndicate the entity for ending at k and type is j；

A_kThe 6th seed type child nodeIndicate entity that there is unit length at the k of position and that type is j, In, unit length indicates the beginning and end of the entity all in same position；

A_kThe 7th seed type child node X presentation-entity end.

Wherein it is possible to count the length of every mark corpus in the mark corpus got first, wherein mark corpus Length is identical as word/word number that the mark corpus is included.For example, " He talked to the It include 6 words in U.S.president ", then length is 6, includes 7 words, then length in " office worker that Zhang San is XXX company " It is 7；Then it can be, but not limited to the number for determining the father node in hypergraph model according to extreme length, that is, determine hypergraph model Time extended length, such as extreme length are 10, it is determined that the number of father node is 10.Then it marks and advises according to preset entity A plurality of types of child nodes and the multiple father nodes of connection are then connected, include the first father node and the in the multiple father node Two father nodes, the first father node and the second father node are two adjacent father nodes of time step, wherein if the first father node is A_k, then the second father node is A_k+1, i.e. the first father node is the father node on k-th of time step, and the second father node is kth+1 Father node on time step, then:

1、A_kIt can connect A_k+1And E_k。

2、E_kIt may be coupled toWherein, since the type of entity has m kind, m is the positive integer not less than 1, i.e.,Packet Include T_k ¹、T_k ²、…、T_k ^m, E_kA super side and T can be passed through_k ¹、T_k ²、…、T_k ^mIn it is any n connection, n be just no more than m Integer.Wherein, T_k+1 ¹、T_k+1 ²、…、T_k+1 ^mWith T_k ¹、T_k ²、…、T_k ^mIt is identical.

3、It may be coupled to (1)The entity for indicating that type is j at the k of position has unit length；(2) It indicates to have type at the k of position for the entity of j and next position will be proceeded to；(3)WithIt indicates (1) and (2) Situation all occurs that there are nest relations between position k, i.e., the entity of two same types.

4、It may be coupled to (1) I_k+1 ^j, indicate that type is that the entity of j will continue since the k of position on the k+1 of position； (2)L_k+1 ^j, indicate that type is that the entity of j terminates since the k of position, and in position k+1；(3)I_k+1 ^jAnd I_k+1 ^j, indicate (1) and (2) situation all occurs in position k.

5、U_k ^jNode can only connect to X node, because the entity beginning and end of unit length is all in identical position.

6、It may be coupled to (1) I_k+1 ^j, indicate that there are the entity covering position k and k+1 that a type is j；(2) L_k+1 ^j, indicate that there are the entity covering position k that a type is j, and terminate at the k+1 of position, (3) I_k+1 ^jAnd L_k+1 ^j, indicate (1) all occur with (2) situation in position k.

Hypergraph model in the embodiment of the present invention is full connection hypergraph model.It is advised wherein it is possible to be connected according to above-mentioned node Full connection hypergraph model as shown in Figure 4 is then established, it includes included by above-mentioned node concatenate rule that this connects hypergraph model entirely All possible connection type between node, wherein identical line style indicates same side.It should be noted that for difference Father node A, meaning expressed by each type of child node corresponding to them be all it is identical, subscript k and k+1 are only used Come two different nodes for distinguishing same type, for example,And I_k+1 ^jEssence is the child node of the 4th seed type,And I_k+1 ^j Connection is equivalent to the child node that two belong to the 4th seed type and is connected with each other.Therefore in Fig. 4, subscript k and k+1 is omitted.

In full connection hypergraph model as shown in Figure 4, figure composed by each father node and its corresponding child node is A kind of special hypergraph (hypergraph is the subgraph in full connection hypergraph model), wherein multiple child nodes, which can form one, to be had The child list of sequence, the hypergraph turn to the conditional probability model of the corresponding possible output sequence s of list entries x

Wherein, x can be word sequence, and s can be the sequence of the mark label composition of each word.Wherein G (x, s) is feature Function, W are weight vectors.Z (x) is normalization factor of all possible side s on x.Then in order to find optimal partitioning algorithm, Enable a_jIndicate the optimal segmentation end point of j-th of input, (m, y) is indicated on m-th of position and had label y, then a_jIt can pass It is calculated as with returning

a_j=max ψ (j-1, y)+a_j-1 (2)

Wherein, ψ (j-1, y) is that the characteristic value defined on side s=(j-1, y) can be rapidly found out by above equation Optimal sequence of partitions in hypergraph model, the i.e. maximum sequence of partitions of characteristic value.

S303 marks rule according to the markup information and the entity, determines the corresponding mark of every mark corpus Infuse path profile.

In the specific implementation, being labeled for each word/word in mark corpus, the unique of each word/word is obtained Path is marked, the mark path of multiple words is combined into constituted hypergraph as marking the corresponding mark path profile of corpus, In, mark path profile is a subgraph of full connection hypergraph model.

For example, the first is place name entity as shown in fig. 5, it is assumed that entity type includes two kinds altogether, second real for name Body according to markup information, determines that " He " is one then in mark corpus " He talked to the U.S.president " Name entity, can be used U²Node (child node of above-mentioned 6th seed type) is labeled, " the U.S.president " For a name entity, multiple I can be used²Node (child node of above-mentioned 4th seed type) and B²Node (it is above-mentioned the third The child node of type) it marks, while in " the U.S.president ", " U.S. " is a place name entity, can use U¹ Node is labeled, remaining word is non-physical, its corresponding father node and related child node can be directly connected to X node, Indicate that there is no entities at this.

S304 is established according to the hypergraph model and preset neural network model to training pattern.

In the specific implementation, as shown in fig. 6, can be using hypergraph model and neural network model as to training pattern Two logical layers, establish the super graphic layer structure of neural net layer-to training pattern.Wherein, preset neural network model can be with But be not limited to two-way shot and long term Memory Neural Networks (Bidirectional Long-Short Time Memory, BiLSTM)。

S305, mark path profile input is described to be trained in training pattern, obtain entity recognition model.

In the specific implementation, hypergraph model includes a plurality of mark path of every mark corpus, a plurality of mark path is The all possible mark path of every mark corpus.The corresponding mark path profile of every mark corpus includes a plurality of mark Target in path marks path, which marks the most reasonable mark path that path is the mark corpus.In addition, wait train Model includes multiple training parameters, and the multiple training parameter can be, but not limited to as the corresponding weight of each edge in hypergraph model Coefficient and/or state-transition matrix.Wherein it is possible to arbitrary initial is carried out to the multiple training parameter first, then according to Hypergraph model and neural network model calculate the score in every mark path in a plurality of mark path, and according to the score tune Whole the multiple training parameter, until the target mark path score in a plurality of mark path highest.

S306, according to the entity recognition model, at least one of identification input corpus name entity.

In the specific implementation, input corpus can be inputted entity recognition model, then entity recognition model will export the input The corresponding mark path profile similar with Fig. 5 of corpus, so as to determine the mark path of the input corpus first, then according to Mark path determines the corresponding mark label of the input corpus, including the mark mark of each character/word in the input corpus Label identify at least one name entity then according to mark label.For example, the mark label of " U.S. " is in input corpus GPE, it is determined that U.S. is place name entity.

Optionally, can after identifying input at least one of corpus name entity according to the entity recognition model, The selection instruction of user's input is received, which carries entity type information, then from least one described name entity The name entity that middle extraction matches with entity type information.For example, the entity type information carried in selection instruction is PER, All name entities in input corpus can then be extracted.

In embodiments of the present invention, a plurality of mark corpus, every mark corpus in a plurality of mark corpus are obtained first Carry markup information；Hypergraph model is established then according to preset entity mark rule；Then according to markup information and entity mark Note rule, determines the corresponding mark path profile of every mark corpus and according to hypergraph model and preset neural network model It establishes to training pattern；Mark path profile input is finally obtained into entity recognition model, and root to be trained in training pattern According to entity recognition model, at least one of identification input corpus name entity.The entity that can effectively identify nested structure, from And improve the accuracy of Entity recognition and entity extraction.

Fig. 7 is referred to, Fig. 7 is the flow diagram of another entity recognition method provided in an embodiment of the present invention, the party Method includes but is not limited to following steps:

S701, obtains a plurality of mark corpus, and a plurality of mark corpus carries markup information.This step and a upper embodiment In S301 it is identical, this step repeats no more.

S702 establishes hypergraph model according to preset entity mark rule.This step and the S302 phase in a upper embodiment Together, this step repeats no more.

S703 marks rule according to the markup information and the entity, determines the corresponding mark of every mark corpus Infuse path profile.This step is identical as the S303 in a upper embodiment, this step repeats no more.

S704 is established according to the hypergraph model and preset neural network model to training pattern.This step and upper one S304 in embodiment is identical, this step repeats no more.

S705, mark path profile input is described to be trained in training pattern.

In the specific implementation, hypergraph model includes a plurality of mark path of every mark corpus, a plurality of mark path is Every all possible mark path of mark corpus.The corresponding mark path profile of every mark corpus includes a plurality of mark road Target in diameter marks path, which marks the most reasonable mark path that path is the mark corpus.In addition, mould to be trained Type includes multiple training parameters, and the multiple training parameter can be, but not limited to as the corresponding weight system of each edge in hypergraph model Several and/or state-transition matrix can carry out the multiple training parameter any before treating training pattern and being trained Initialization.

It is possible, firstly, to determine first spy in every mark path in a plurality of mark path according to neural network model Levy score, and according to hypergraph model determine every mark path second feature score, wherein can be, but not limited to combine to Before-backward algorithm and greatest hope (Expectation Maximization Algori-thm, EM) algorithm calculate feature point Number；Then the score in path, every mark road are marked using the sum of fisrt feature score and second feature score as described every The score of diameter can be used to measure the reasonability with the mark path.

In order to improve the accuracy for the score for marking path, feature scores can be calculated by different characteristic dimension.Its In, corresponding at least one first language material feature of hypergraph model, corresponding at least one second language material feature of neural network model, therefore The fisrt feature point in every mark path can be determined first according to every kind of first language material feature at least one first language material feature Magnitude and according to every kind of second language material feature at least one second language material feature, determines that the second of every mark path is special Levy component value, then using the sum of fisrt feature component value as fisrt feature score and by second feature component value and conduct Second feature score.Wherein, the first language material feature can be special by the context of forward-backward algorithm capture mark corpus Sign, the second language material feature can include but is not limited to state transfer characteristic, word feature (window size can be 3), language mould Formula feature (n-gram feature), part of speech label characteristics (window size can be 3), bag of words feature (window size can be 5) and At least one of in word mode feature, wherein n-gram feature may include word n-grams feature and part of speech n-grams Feature, n can be 2,3,4.Word mode may include full capitalization, digital, full alphanumeric, comprising number, comprising point, packet Containing at least one in hyphen, initial caps, lonely initial, punctuation mark, Roman number, monocase and URL.State Word/word that transfer characteristic is used to describe in mark corpus is changed into the general of another type of entity by a type of entity Rate (alternatively score).

Such as: the first language material feature is contextual feature, and the second language material feature is state transfer characteristic, then for mark language Expect [x]_T1, [f can be used_θ]_i,tIndicate [x] being calculated according to neural network model_TIn t-th of word be noted as the i-th type The score of the entity of type, [x]_TIn share T word, to obtain an eigenmatrix [f according to the score of each word/word_θ], Wherein, θ is the setting parameter of neural network.State transfer characteristic matrix used in hypergraph model is [A], [A]_{I, j}Indicate from The score that i-th of state is converted to j-th, what needs to be explained here is that, turn for used in each word/word in corpus It is all the same to move eigenmatrix.Finally, corpus [x] is marked_TSequence label [i]_TThe feature scores s in corresponding mark path ([x]_T,[i]_T, θ) and it can be calculated by (3) formula.And when the first language material feature and the second language material feature separately include multiple, The corresponding feature scores matrix of every kind of language material feature is successively added according to (3) formula.Wherein it is possible to mark corpus " He Illustrate sequence label for talked to the U.S.president " and mark the relationship in path, in the mark corpus " He ", " talked ", " to ", " the U.S.president " and " U.S. " mark label be respectively PER, O, O, PER and GPE, then sequence label is PER, O, O, PER, GPE, and the corresponding mark path profile of the sequence label is as figure 5 illustrates.

For another example, mark includes 4 word x in corpus X₁、x₂、x₃And x₄.Default entity type includes 3 kinds of a in total₁、a₂And a₃。 It is W according to the eigenmatrix that neural network model obtains, wherein the element representation x of the i-th row jth column_iBelong to jth seed type The score of entity, then according to the feature scores in W available every mark path, such as: one of X may mark ω pairs of path The sequence label answered is a₁、a₃、a₂、a₁, then x is obtained according to W₁Belong to a₁Score be 1.5, x₂Belong to a₃Score be 0.11, x₃ Belong to a₂Score be 0.002 and x₄Belong to a₁Score be 0.12, therefore the fisrt feature score of mark path ω is 1.5+0.11+0.002+0.12=1.772.The state-transition matrix of hypergraph model is Q, wherein the list of elements that m row n-th arranges Show the probability (score) for being changed into n type by m seed type, is then directed to above-mentioned mark path ω, obtains x₁From type a₁Turn It is changed to type a₃Score be 0.1, x₂From type a₃It is changed to type a₂Score be 0 and x₃From type a₂Be converted to type a₁'s Score is 0.008, therefore the second feature score of mark path ω is 0.1+0+0.008=0.108, therefore marks path ω It is scored at 1.772+0.108=1.88.

Then, the score that path is marked according to every adjusts the multiple training parameter to improve target mark road The score of diameter.

The entity obtained in compared with the prior art based on the entity recognition model of hypergraph model, training of the embodiment of the present invention Identification model is obtaining traditional characteristic matrix (such as state-transition matrix, the n-gram feature square in mark path using hypergraph model Battle array etc.) on the basis of, a neural network characteristics matrix is calculated by neural network more, to calculate mark path together Feature scores improve the accuracy that the reasonability in mark path differentiates.It is proposed in addition, the embodiment of the present invention is directed to nested entity Special hypergraph model can effectively solve the problems, such as identify nested entity in the prior art.

S706 determines the score in the mark path profile corresponding target mark path included by the hypergraph model It whether is top score in a plurality of mark path.If so, S707 is executed, if it is not, then continuing to execute S705.

In the specific implementation, a plurality of mark corpus that can be will acquire is divided into more parts (batch), for example, getting 1000 altogether Item marks corpus, then is classified as 4 batch, and each batch includes 250.It therefore, can be first by one of batch Input is trained to training pattern, if the score in target mark path is not highest in a plurality of mark path after the completion of training Score then continues to treat training pattern being trained, to continue to adjust the instruction of gang mould type to be instructed followed by another batch Practice parameter until target mark path is scored at the top score in a plurality of mark path.

S707 obtains entity knowledge using the corresponding training parameter of the top score as the setting parameter to training pattern Other model.

In the specific implementation, when the score in target mark path is highest scoring in a plurality of mark path, target mark Probability of occurrence of the path in connection hypergraph model entirely is maximum.The relevant parameter to training pattern can then be set to described to set Entity recognition model is used as after setting parameter.

S708, according to the entity recognition model, at least one of identification input corpus name entity.This step with it is upper S306 in one embodiment is identical, this step repeats no more.

In embodiments of the present invention, a plurality of mark corpus, every mark corpus in a plurality of mark corpus are obtained first Carry markup information；Hypergraph model is established then according to preset entity mark rule；Then according to markup information and entity mark Note rule, determines the corresponding mark path profile of every mark corpus and according to hypergraph model and preset neural network model, It establishes to training pattern；Mark path profile input is finally obtained into entity recognition model, and root to be trained in training pattern According to entity recognition model, at least one of identification input corpus name entity.The entity that can effectively identify nested structure, from And improve the accuracy of Entity recognition and entity extraction.

It is above-mentioned to illustrate the method for the embodiment of the present invention, the relevant device of the embodiment of the present invention is provided below.

Fig. 8 is referred to, Fig. 8 is a kind of structural schematic diagram of entity recognition device provided in an embodiment of the present invention, the entity Identification device may include:

Module 801 is obtained, for obtaining a plurality of mark corpus, a plurality of mark corpus carries markup information.

In the specific implementation, mark corpus can be the sentence of random length, such as " He talked to the U.S.president ", " office worker that Zhang San is XXX company " etc..The markup information of every mark corpus includes in the mark corpus The mark label of each word/word.It whether is assured that in corpus according to mark label comprising naming entity and being included Name entity type, in order to below the terseness of description will name entity be referred to as entity.Entity type can with but it is unlimited In including name entity and place name entity, mark label can be number, character and character string, wherein the mark of name entity Label can be, but not limited to can be, but not limited to for the mark label of PER, place name entity as GPE.

Modeling module 802, for establishing hypergraph model according to preset entity mark rule.

A_kThe 7th seed type child node X presentation-entity end.

1、A_kIt can connect A_k+1And E_k。

5、Node can only connect to X node, because the entity beginning and end of unit length is all in identical position.

Hypergraph model in the embodiment of the present invention is full connection hypergraph model.It is advised wherein it is possible to be connected according to above-mentioned node Full connection hypergraph model as shown in Figure 4 is then established, it includes the section that above-mentioned node concatenate rule includes that this connects hypergraph model entirely All possible connection type between point, wherein identical line style indicates same side.

Labeling module 803 determines every mark language for marking rule according to the markup information and the entity Expect corresponding mark path profile.

In the specific implementation, being labeled for each word/word in mark corpus, the unique of each word/word is obtained Path is marked, the mark path of multiple words is combined into constituted hypergraph as marking the corresponding mark path profile of corpus, In, mark path profile is a subgraph of full connection hypergraph model.For example, as shown in fig. 5, it is assumed that entity type includes two altogether Kind, the first is place name entity, and second is name entity, then in mark corpus " He talked to the In U.S.president ", according to markup information, determines that " He " is a name entity, U can be used²Node the (the above-mentioned 6th The child node of seed type) be labeled, " the U.S.president " is also for a name entity, multiple I can be used²Section Point (child node of above-mentioned 4th seed type) and B²Node (child node of above-mentioned third seed type) marks, while in " the In U.S.president ", " U.S. " is a place name entity, can use U¹Node is labeled, remaining word is non-physical, Its corresponding father node and related child node can be directly connected to X node, indicate that there is no entities at this.

Modeling module 802 is also used to establish mould to be trained according to the hypergraph model and preset neural network model Type.

In the specific implementation, as shown in fig. 6, can be using hypergraph model and neural network model as to training pattern Two logical layers, establish the super graphic layer structure of neural net layer-to training pattern.Wherein, preset neural network model can be with But it is not limited to BiLSTM.

Training module 804, for, to be trained in training pattern, entity will to be obtained described in mark path profile input Identification model.

In the specific implementation, hypergraph model includes a plurality of mark path of every mark corpus, a plurality of mark path is Every all possible mark path of mark corpus.The corresponding mark path profile of every mark corpus includes a plurality of mark road Target in diameter marks path, which marks the most reasonable mark path that path is the mark corpus.In addition, mould to be trained Type includes multiple training parameters, and the multiple training parameter can be, but not limited to as the corresponding weight system of each edge in hypergraph model Several and/or state-transition matrix can appoint the multiple training meal parameter before treating training pattern and being trained Meaning initialization.

In order to improve mark path score accuracy, feature scores can be calculated by different characteristic dimension, In, corresponding at least one first language material feature of hypergraph model, corresponding at least one second language material feature of neural network model, therefore The fisrt feature point in every mark path can be determined first according to every kind of first language material feature at least one first language material feature Magnitude and according to every kind of second language material feature at least one second language material feature, determines that the second of every mark path is special Levy component value, then using the sum of fisrt feature component value as fisrt feature score and by second feature component value and conduct Second feature score.Wherein, the first language material feature can be special by the context of forward-backward algorithm capture mark corpus Sign, the second language material feature can include but is not limited to state transfer characteristic, word feature (window size can be 3), language mould Formula feature (n-gram feature), part of speech label characteristics (window size can be 3), bag of words feature (window size can be 5) and At least one of in word mode feature, wherein n-gram feature may include word n-grams feature and part of speech n-grams Feature, n can be 2,3,4.Word mode may include full capitalization, digital, full alphanumeric, comprising number, comprising point, packet Containing at least one in hyphen, initial caps, lonely initial, punctuation mark, Roman number, monocase and URL.State Word/word that transfer characteristic is used to describe in mark corpus is changed into another type of entity class by a type of entity Probability (alternatively score).

Then, the score that path is marked according to every, adjusts the multiple training parameter, so that target mark path The top score being divided into a plurality of mark path, and be to training pattern by the corresponding training parameter seat of the top score Setting parameter, obtain entity recognition model.Wherein, a plurality of mark corpus that can be will acquire is divided into more parts (batch), example Such as, 1000 mark corpus are got altogether, then are classified as 4 batch, and each batch includes 250.It therefore, can be first One of batch is inputted and is trained to training pattern, if the score in target mark path is described more after the completion of training It is not top score in path that item, which marks, then continues to treat training pattern being trained followed by another batch, so as to after The continuous training parameter for adjusting gang mould type to be instructed is scored at the top score in a plurality of mark path until target mark path.

Identification module 805, for according to the entity recognition model, the name of at least one of identification input corpus to be real Body.

Optionally, device described in the embodiment of the present invention further includes abstraction module, can be used for knowing according to the entity After at least one of other model identification input corpus name entity, the selection instruction of user's input is received, which takes Then it is real to extract the name to match with entity type information from least one described name entity for band entity type information Body.For example, the entity type information carried in selection instruction is PER, then all name entities in input corpus can be extracted.

Fig. 9 is referred to, Fig. 9 is a kind of structural schematic diagram of Entity recognition equipment provided in an embodiment of the present invention.As schemed Show, which may include: at least one processor 901, at least one communication interface 902, at least one storage Device 903 and at least one communication bus 904.

Wherein, processor 901 can be central processor unit, general processor, digital signal processor, dedicated integrated Circuit, field programmable gate array or other programmable logic device, transistor logic, hardware component or it is any Combination.It, which may be implemented or executes, combines various illustrative logic blocks, module and electricity described in the disclosure of invention Road.The processor is also possible to realize the combination of computing function, such as combines comprising one or more microprocessors, number letter Number processor and the combination of microprocessor etc..Communication bus 904 can be Peripheral Component Interconnect standard PCI bus or extension work Industry normal structure eisa bus etc..The bus can be divided into address bus, data/address bus, control bus etc..For convenient for indicate, It is only indicated with a thick line in Fig. 9, it is not intended that an only bus or a type of bus.Communication bus 904 is used for Realize the connection communication between these components.Wherein, the communication interface 902 of equipment is used for and other nodes in the embodiment of the present invention Equipment carries out the communication of signaling or data.Memory 903 may include volatile memory, such as non-volatile dynamic random is deposited Take memory (Nonvolatile Random Access Memory, NVRAM), phase change random access memory (Phase Change RAM, PRAM), magnetic-resistance random access memory (Magetoresistive RAM, MRAM) etc., can also include non- Volatile memory, for example, at least a disk memory, Electrical Erasable programmable read only memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), flush memory device, such as anti-or flash memory (NOR Flash memory) or anti-and flash memory (NAND flash memory), semiconductor devices, such as solid state hard disk (Solid State Disk, SSD) etc..Memory 903 optionally can also be that at least one is located remotely from the storage of aforementioned processor 901 Device.Batch processing code is stored in memory 903, and processor 901 executes the program in memory 903:

Hypergraph model is established according to preset entity mark rule；

Optionally, the hypergraph model includes multiple father nodes, and each father node correspondence in the multiple father node is more The child node of seed type；

Processor 901 is also used to perform the following operations step:

Rule connection a plurality of types of child nodes are marked according to the entity and the multiple father node of connection obtains To the hypergraph model.

Optionally, the multiple father node includes the first father node and the second father node；

Processor 901 is also used to perform the following operations step:

Connect the child node of the first seed type of first father node and second of type of first father node Child node；And

Connect the child node of second of type of first father node and the third seed type of first father node At least one of the child node of child node and the 6th seed type；And

Connect the child node of the third seed type of first father node and the 4th seed type of second father node At least one of the child node of child node and the 5th seed type；And

Connect the child node of the 4th seed type of first father node and the 4th seed type of second father node At least one of the child node of child node and the 5th seed type；And

Connect the child node of the 6th seed type of first father node and the child node and described first of the 5th seed type The child node of 7th seed type of father node；And

Connect first father node and second father node.

Optionally, described to training pattern includes multiple training parameters；It include described every mark in the hypergraph model The a plurality of mark path of corpus；It include the target mark path in a plurality of mark path in the mark path profile；

Optionally, processor 901 is also used to perform the following operations step:

According to the neural network model determine it is described it is a plurality of mark path in every mark path fisrt feature score, And the second feature score in every mark path is determined according to the hypergraph model；

The score in path is marked using the sum of the fisrt feature score and the second feature score as described every；

The score that path is marked according to described every adjusts the multiple training parameter so that the target marks path The top score being scored in a plurality of mark path；

Using the corresponding multiple training parameters of the top score as the setting parameter to training pattern, obtain described Entity recognition model.

Optionally, corresponding at least one first language material feature of the hypergraph model；The neural network model is corresponding at least A kind of second language material feature；

Processor 901 is also used to perform the following operations step:

According to every kind of first language material feature at least one first language material feature, every mark path is determined Fisrt feature component value and according to every kind of second language material feature at least one second language material feature, determines described every The second feature component value in item mark path；

Using the sum of the fisrt feature component value as the fisrt feature score and by the second feature component value Sum as the second feature score.

The input corpus is inputted into the entity recognition model, obtains the mark path of the input corpus；

According to the mark path, the corresponding mark label of the input corpus is determined；

According to the mark label, at least one described name entity of identification.

The selection instruction of user's input is received, the selection instruction carries entity type information；

The name entity to match with the entity type information is extracted from least one described name entity.

Further, processor can also be matched with memory and communication interface, be executed real in foregoing invention embodiment The operation of body identification device.

In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer program Product includes one or more computer instructions.When loading on computers and executing the computer program instructions, all or It partly generates according to process or function described in the embodiment of the present invention.The computer can be general purpose computer, dedicated meter Calculation machine, computer network or other programmable devices.The computer instruction can store in computer readable storage medium In, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computer Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center User's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server or Data center is transmitted.The computer readable storage medium can be any usable medium that computer can access or It is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be with It is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..

Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail.All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in Within protection scope of the present invention.

Claims

1. a kind of entity recognition method, which is characterized in that the described method includes:

Hypergraph model is established according to preset entity mark rule；

According to the markup information and entity mark rule, the corresponding mark path profile of every mark corpus is determined；

2. the method as described in claim 1, which is characterized in that the hypergraph model includes multiple father nodes, the multiple father Each father node in node corresponds to a plurality of types of child nodes；

It is described to establish hypergraph model according to preset entity mark rule and include:

Rule is marked according to the entity and connects a plurality of types of child nodes and the multiple father node of connection, is obtained The hypergraph model.

3. method according to claim 2, which is characterized in that the multiple father node includes that the first father node and the second father save Point；

It is described to mark rule connection a plurality of types of child nodes and the multiple father node of connection according to the entity, Obtaining the hypergraph model includes:

The son for connecting the child node of the first seed type of first father node and second of type of first father node saves Point；And

The son for connecting the child node of second of type of first father node and the third seed type of first father node saves At least one of the child node of point and the 6th seed type；And

The son for connecting the child node of the third seed type of first father node and the 4th seed type of second father node saves At least one of the child node of point and the 5th seed type；And

The son for connecting the child node of the 4th seed type of first father node and the 4th seed type of second father node saves At least one of the child node of point and the 5th seed type；And

It connects the child node of the 6th seed type of first father node and the child node of the 5th seed type and first father saves The child node of 7th seed type of point；And

Connect first father node and second father node.

4. the method as described in claim 1, which is characterized in that described to training pattern includes multiple training parameters；It is described super Graph model includes a plurality of mark path of every mark corpus；The mark path profile includes in a plurality of mark path Target mark path；

Described to be trained described in mark path profile input to training pattern, obtaining entity recognition model includes:

According to the neural network model determine it is described it is a plurality of mark path in every mark path fisrt feature score and The second feature score in every mark path is determined according to the hypergraph model；

The score that path is marked according to described every adjusts the multiple training parameter so that the target marks the score in path For the top score in a plurality of mark path；

Using the corresponding multiple training parameters of the top score as the setting parameter to training pattern, the entity is obtained Identification model.

5. method as claimed in claim 4, which is characterized in that corresponding at least one first language material feature of the hypergraph model； Corresponding at least one second language material feature of the neural network model；

It is described according to the neural network model determine it is described it is a plurality of mark path in every mark path fisrt feature score, And determine that the described every second feature score for marking path includes: according to the hypergraph model

According to every kind of first language material feature at least one first language material feature, the first of every mark path is determined Characteristic component value and according to every kind of second language material feature at least one second language material feature, determines every mark Infuse the second feature component value in path；

Using the sum of the fisrt feature component value as the fisrt feature score and by the sum of the second feature component value As the second feature score.

6. method as claimed in claim 5, which is characterized in that at least one second language material feature includes that state transfer is special At least one of in sign, word feature, language mode feature, part of speech label characteristics, bag of words feature and word mode feature.

7. the method as described in claim 1, which is characterized in that described according to the entity recognition model, identification input corpus At least one of name entity include:

8. the method according to claim 1 to 7, which is characterized in that described according to the entity recognition model, identification It inputs after at least one of corpus name entity, further includes:

9. a kind of entity recognition device, which is characterized in that described device includes:

Module is obtained, for obtaining a plurality of mark corpus, every mark corpus carries markup information in a plurality of mark corpus；

Labeling module determines that every mark corpus is corresponding for marking rule according to the markup information and the entity Mark path profile；

The modeling module is also used to be established according to the hypergraph model and preset neural network model to training pattern；

Training module, for, to be trained in training pattern, Entity recognition mould will to be obtained described in mark path profile input Type；