CN108182295A

CN108182295A - A kind of Company Knowledge collection of illustrative plates attribute extraction method and system

Info

Publication number: CN108182295A
Application number: CN201810136568.4A
Authority: CN
Inventors: 孙世通; 刘德彬; 严开; 陈玮
Original assignee: Chongqing Yu Yu Da Data Technology Co Ltd
Current assignee: China Telecom Yijin Technology Co.,Ltd.; Chongqing Yucun Technology Co ltd
Priority date: 2018-02-09
Filing date: 2018-02-09
Publication date: 2018-06-19
Anticipated expiration: 2038-02-09
Also published as: CN108182295B

Abstract

The present invention provides a kind of Company Knowledge collection of illustrative plates attribute extraction method, includes the following steps：Define entity class and event category；To every a kind of substantial definition attribute structure；Language material prepares and mark；Entity attribute extracts；Entity attribute merges.The present invention=combine objectivity, high efficiency that expert extracts content of text and classify to the knowledge of specific field entity attribute with machine learning, and applied in the Chinese language material of full dose business data；It realizes and all kinds of objective attribute target attributes is identified with less amount of mark.It solves the problems, such as the extraction of knowledge mapping interior joint entity attribute and the fusion of multi-source attribute.

Description

A kind of Company Knowledge collection of illustrative plates attribute extraction method and system

Technical field

The present invention relates to a kind of information processing methods and system to be specifically related to a kind of Company Knowledge collection of illustrative plates attribute extraction method And system.

Background technology

Knowledge mapping is a kind of semantic network based on graph data structure, and basic unit is node (Node) and side (Edge).In Company Knowledge collection of illustrative plates, node characterization event entity and business entity；Relationship between the characterization entity of side.Entire enterprise In industry knowledge mapping, if focusing on an enterprise, it can be found that its essential information, the development that each event node concatenation forms is gone through The contents such as journey, each layer affiliated enterprise clustering (" association " here includes but is not limited only to equity investment, cooperation, upstream and downstream, subordinate Deng).

Knowledge mapping is applied to company information and finds field with business risk, and core value is the enterprise each classification Industry information is organically together in series, so as to which risk model be contributed to go wherein hiding co-related risks, group's risk of identification etc..And In the step for structuring node data, two large problems are mainly faced：1) different attribute is extracted from different data sources, 2) it is right The attribute from separate sources is rationally merged in same entity.

For technological layer, such Company Knowledge collection of illustrative plates is built, following two difficult points need be captured：

Entity attribute extracts and multi-source attribute merges the establishment of the relationship between different entities.

The prior art is used based on industry experience rule with the attribute extraction of dictionary with merging and based on supervised learning and mould The matched attribute extraction of formula is with merging.

The shortcomings that prior art is with the attribute extraction of dictionary with merging based on industry experience rule：To the reality of different industries Body, industry attribute is established a capital really needs senior industry specialists intervention, but can not overcome annotating efficiency low always by manpower entirely Under, the problems such as labeled standards are inconsistent.Although and rely on the dictionary of unified standard that can know the pass of the word centered on verb in text System, but the Relation extraction of noun appositive etc is easy for judging by accident.In addition this method can not have unregistered word The processing of effect ground and judgement.

The prior art also has using the attribute extraction based on supervised learning and pattern match with merging：By manually marking Language material on structural classification device, but its main bottleneck is that the mark of needs is more, and higher to data quality requirement.

Prior art Company Knowledge collection of illustrative plates attribute extraction is encountered figure, audio and video, text and is gone out simultaneously based on text data Existing, there are certain restrictions when needing across source processing.It is not accounted in modeling process yet and extracts different levels, the reality of granularity The situation of body and relationship.

Prior art Company Knowledge collection of illustrative plates attribute extraction is to the processing of target text using artificial mark, inefficiency cost It is high, it is impossible to which that mass text is quickly handled.

Prior art Company Knowledge collection of illustrative plates attribute extraction can not achieve correlation analysis and reasoning between text, realize that end is arrived The adaptive learning at end is established with relationship.

Invention content

The present invention provide it is a kind of can efficiently, the method for automatic, accurate progress Company Knowledge collection of illustrative plates attribute extraction, including with Lower step：

Define the entity class, event category, entity attribute structure of training sample；

Training sample language material prepares and mark；

Training entity attribute extraction model；

Target text input entity attribute extraction model is obtained into target text entity attribute；

Entity attribute fusion is performed to target text.

Further, the entity class, event category, entity attribute structure for defining training sample includes,

It is enterprise's factor or/and individual factor to define entity class；

Definition event category for judgement document, law court's bulletin, announcement of court session, bidding, equity, strategy, occurrences in human life, finance, It is a variety of or a kind of in debt, product, marketing, brand, accident；

The field of defined attribute is type field, a variety of or a kind of in time field, tag field, body field；

Training sample language material prepares and mark includes, the event category of each text and entity attribute structure to training sample database Mark.

Further, training entity attribute extraction model includes the following steps：

S1：It marks by word, is inputted N*K dimension word vector matrixs as the first two-way long short-term memory Recognition with Recurrent Neural Network, The N*T dimension mark class probability distribution matrixes of each word are obtained, wherein N is batch dimensional values, and K is embedded in vector length for word, and T is word The classification number of mark, the position of maximum value correspond to the label of current word, and obtain the word embedding data of each word；

S2：Determine training sample main information；

S3:Event vector is defined as the following formula, wherein, eventEmbedding is event vector, w_jIt represents in sentence j-th The vector of word, n represent the sentence within main body longitudinal separation n；

It is marked by event, N*K dimension event vector matrixes is initial as the second two-way long short-term memory Recognition with Recurrent Neural Network Input, wherein N are batch dimensional values, and K is embedded in vector length for word, and L is the classification number of event mark, and the position of maximum value corresponds to The label of current event.

Defining Bayes network is：

P (A, B, C, D)=P (D | A, B) * P (C | A) * P (B | A) P (A)

A is the probability whether text describes certain class event,

B is the successful probability of event extraction,

C is the probability containing temporal information,

D is the probability of the vocabulary containing specific area,

Wherein whether the value of B is identical with training sample mark certainly by the label of N*L dimension mark class probability distribution matrix outputs It is fixed, if identical B be assigned a value of 1 if differing B be assigned a value of 0,

It is from the second the first N*L of two-way long short-term memory Recognition with Recurrent Neural Network acquisition dimension matrixes and the first N*L dimension matrixes is defeated Enter Bayesian network, the 2nd N*L dimension matrixes of Bayesian network output and the first N*L dimension matrixes are performed into Fusion Features, it will be special Sign fusion results feed back to the second two-way long short-term memory Recognition with Recurrent Neural Network；

S4：Loss function is defined as the output of the two-way long each timing node of short-term memory Recognition with Recurrent Neural Network and training sample The mean square error of this marking data repeats step S3 to loss function convergence.

Further, entity attribute extraction model, including,

Before the second two-way long short-term memory Recognition with Recurrent Neural Network the first N*L dimension matrixes are obtained to hidden layer and by the first N* L ties up Input matrix Bayesian network, and the 2nd N*L dimension matrixes of Bayesian network output and the first N*L dimension matrixes are performed feature Fusion, using Fusion Features result as the input to hidden layer after the second two-way long short-term memory Recognition with Recurrent Neural Network；

Alternatively,

The first N*L dimension matrixes are obtained from the second two-way long short-term memory Recognition with Recurrent Neural Network output layer and tie up the first N*L The 2nd N*L dimension matrixes of Bayesian network output are performed feature with the first N*L dimension matrixes and melted by Input matrix Bayesian network It closes, using Fusion Features result as the input of the second two-way long short-term memory Recognition with Recurrent Neural Network input layer；

Further, entity attribute fusion is performed to target text to include the following steps：

A selectes the foundation structure of event solid data as substrate value according to the similitude with stay in place form；

B traverses Candidate Set event, by tree depth-first sequence matching tree-shaped；

C is when two events compare, it then follows following rule：

If existence foundation structure node attribute value lacks, directly supplement；

If in existence foundation structure, corresponding node attribute values clash, if quality evaluation functions obtain Candidate Set Property value is more excellent, and the non-null value of substrate is replaced；

If substrate attribute is listings format, increase the table of substrate non-duplicate element exclusive in Candidate Set；

D repeat step B and step C can not continue to attribute it is perfect.

In order to ensure the implementation of the above method, the present invention also provides a kind of Company Knowledge collection of illustrative plates attribute extraction system, including With lower unit：

Definition unit, for defining the entity class of training sample, event category, entity attribute structure；

Mark unit, for the preparation of training sample language material and mark；

Training unit, for training entity attribute extraction model；

Entity attribute extracting unit, for target text input entity attribute extraction model to be obtained target text entity category Property；

Attribute integrated unit, for performing entity attribute fusion to target text.

Further, definition unit defines the entity class, event category, entity attribute structure of training sample and includes,

It is enterprise's factor or/and individual factor to define entity class；

The training sample language material prepares and mark includes the event category and entity attribute of each text to training sample database Structure marks.

Further, training unit trains entity attribute extraction model using following steps：

S2：Determine training sample main information；

Defining Bayes network is：

P (A, B, C, D)=P (D | A, B) * P (C | A) * P (B | A) P (A)

A is the probability whether text describes certain class event,

B is the successful probability of event extraction,

C is the probability containing temporal information,

D is the probability of the vocabulary containing specific area,

Wherein whether the value of B is identical with training sample mark certainly by the label of N*L dimension mark class probability distribution matrix outputs Fixed, B is assigned a value of 1 if identical, if differing B is assigned a value of 0,

Further, entity attribute extraction model, including,

The first N*L dimension matrixes are obtained to hidden layer before the second two-way long short-term memory Recognition with Recurrent Neural Network, and by first N*L ties up Input matrix Bayesian network, and the 2nd N*L dimension matrixes of Bayesian network output and the first N*L dimension matrixes are performed spy Sign fusion, using Fusion Features result as the input to hidden layer after the second two-way long short-term memory Recognition with Recurrent Neural Network；

Alternatively,

Further, attribute integrated unit takes following steps to perform entity attribute fusion to target text：

B traverses Candidate Set event, by the pairs of match attribute of tree depth-first sequence；

C is when two events compare, it then follows following rule：

D repeat step B and step C can not continue to attribute it is perfect.

The beneficial effects of the invention are as follows：

1 realizes the acquisition of knowledge in multi-source heterogeneous data and reduces degree of dependence of the algorithm model to label.

2 realize the establishment of relationship between entity attribute extraction and the fusion of multi-source attribute and different entities.

It is objective with classification that 3 combination experts extract content of text the knowledge of specific field entity attribute and machine learning Property, high efficiency, and applied to full dose business data Chinese language material in；It realizes and all kinds of target categories is identified with less amount of mark Property.

4 by sample data to attribute extraction model training after, automation is real to be realized to magnanimity target text data Body attribute extraction and knowledge mapping structure, improve efficiency, reduce human cost.

5 present invention combine Bayesian network and the advantage of LSTM, propose Bayes's return type neural network.Wherein, pattra leaves This network work feeds back BiLSTM return type neural networks, realizes laterally real using the capture of BiLSTM return types neural network The temporal correlation of the long range of long-time between body realizes correlation analysis and reasoning using Bayesian network in the longitudinal direction.Together When, BiLSTM is updated by the reasoning results for feeding back Bayesian network, so as to fulfill adaptive learning end to end with Relationship is established.

Description of the drawings

Fig. 1 is one embodiment of the invention Company Knowledge collection of illustrative plates attribute extraction method flow diagram.

Fig. 2 is one embodiment of the invention Company Knowledge collection of illustrative plates attribute extraction system construction drawing.

Fig. 3 is prior art shot and long term memory network schematic diagram.

Fig. 4 is prior art BiLSTM neural network model schematic diagrames.

Fig. 5 is one embodiment of the invention Bayes's return type neural network model schematic diagram.

Fig. 6 is one embodiment of the invention Bayesian network schematic diagram.

Fig. 7 is prior art LSTM memory module schematic diagrames of the present invention.

Fig. 8 is one embodiment of the invention Fusion Features schematic diagram.

Fig. 9 is one embodiment of the invention Fusion Features schematic diagram.

Specific embodiment

The present invention solves the problems, such as that one of thinking that background technology describes is：Using Bayes's return type neural network as real Body attribute extraction model realization Company Knowledge collection of illustrative plates attribute extraction.Wherein, Bayesian network is arrived as a network layer stack The upper strata of BiLSTM return type neural networks, so as to fulfill the length between entity is laterally captured using BiLSTM return types neural network The temporal correlation of time long range realizes correlation analysis and reasoning using Bayesian network in the longitudinal direction.Meanwhile by anti- The reasoning results of feedback Bayesian network are updated BiLSTM, are established so as to fulfill adaptive learning end to end and relationship. The entity attribute extraction model of precise and high efficiency is built, realizes the automation that entity attribute extracts.

As shown in Figure 1, Company Knowledge collection of illustrative plates attribute extraction method of the present invention includes the following steps：

Training sample language material prepares and mark；

Training entity attribute extraction model；

Entity attribute fusion is performed to target text.

Wherein, it defines in entity class and event category step,

Entity class can be enterprise or individual.

Event category can be, judgement document, law court's bulletin, announcement of court session, bidding, equity, strategy, occurrences in human life, finance, Debt, product, marketing, brand, accident etc.

For every a kind of entity, the attribute structure of its standardization is defined, by taking accident class as an example, in an embodiment of the present invention The attribute structure of definition event is：

By taking equity as an example, the attribute structure for defining event in an embodiment of the present invention is：

In the preparation and marking step of language material, word Marking Guidelines and meaning are as follows in an embodiment of the present invention：

B-ORG represents entity start bit label

I-ORG represents entity composition label

X represents the placeholders such as punctuate

O represents other words

After the completion of language material mark, down-stream is appreciated that the meaning of entity in text, facilitate machine to text at Reason.

More than specification is pressed in an embodiment of the present invention, completes the mark of each word of training text.

Event tag specification and meaning are as follows in an embodiment of the present invention：

JUDGE represents judgement document；

NOTICE represents law court's bulletin；

COURT represents announcement of court session；

BIDDING represents bidding；

STOCK represents equity；

STRATEGY represents strategy；

HR represents occurrences in human life；

FINANCE represents finance；

DEBET represents debt；

PROD representative products；

MARKET represents marketing；

BRAND represents brand；

ACCIDENT represents accident；

It should be noted that the label and specification of event can flexibly be selected according to specific project, and do not limit only The above-mentioned event enumerated using the present invention.

Event tag facilitates down-stream to handle text using English statement.

More than specification is pressed, completes the mark of every text of training text.

The mark of training text in an embodiment of the present invention is by manually carrying out, and mark result is as model in subsequent step Trained benchmark.

Training entity attribute extraction model step is illustrated with reference to embodiment,

In view of current main-stream method existing a series of problems (being referred in background technology) in processing entities attribute extraction, Intend coping with these difficult points based on deep neural network.The present invention is proposed in the event entity attributes pumping for enterprises as principal components Take in problem, the semi-supervised end to end and unsupervised method of application, so as to fulfill knowledge in multi-source heterogeneous data acquisition with And reduce degree of dependence of the algorithm model to label.

Shot and long term memory network (Long Short-Term Memory Network, LSTM), is a kind of special reply Formula neural network, to the long-term dependence of learning time sequence data.It has been widely used in hand since being suggested It writes, speech recognition, the numerous areas such as machine translation, and obtains original achievement.It can realize the long-term memory of data, in text There is significant effect in semantic analysis.LSTM is unfolded on time dimension, can obtain chain LSTM neural networks, can be right Relationship between the uncertain entity of length and entity is modeled, and then characterizes its respective feature.LSTM memory modules such as Fig. 7 It is shown.

The cell of LSTM can be characterized with the following formula：

i_t=g (W_xix_t+W_hih_t-1+b_i)

f_t=g (W_xfx_t+W_hfh_t-1+b_f)

o_t=g (W_xox_t+W_hoh_t-1+b_o)

Input variation can be characterized with the following formula：

c_in_t=tanh (W_xcx_t+W_hch_t-1+b_{c_in})

Status Change can be characterized with the following formula：

c_t=f_t·c_t-1+i_t·c_in_t

h_t=o_t·tanh(c_t)

Two-way shot and long term memory network (Bidirectional LSTM, BiLSTM) is implied comprising preceding to hidden layer with backward Two groups of module of layer can obtain the associated dependence of the long range of context long-time, capture context substance feature, obtain more Temporal correlation between multiple entity, and can from both direction the noises such as exclusive PCR entity to the shadow of neural network model It rings, excavation of the very big power-assisted to long-term dependence is extracted to the vital height such as information extraction and entity-relationship recognition Layer semantic feature.With respect to Bayesian network, the advantage of LSTM and its mutation is the long sequence relation that can be captured between entity, but its Inferential capability and interpretation are poor.BiLSTM neural network models are as shown in Figure 4.

Bayesian network (Bayesian Network, BN), also known as belief network (Belief Network) are a kind of general Rate graph model.Causal uncertainty has so as to fulfill relationship foundation and reasoning during it simulates mankind inference Good Knowledge representation and the ability for handling uncertainty knowledge.In addition, Bayesian network can carry out knowledge from probability angle Coding and explanation are being widely used including many fields such as computer intelligence science, medical diagnosis, information retrieval.Shellfish The advantages of this network of leaf is powerful inferential capability, and shortcoming is then poor to the modeling ability of long sequence, it is impossible to be caught well Grasp the indirect relation between entity and entity.

The present invention combines Bayesian network and the advantage of BiLSTM, proposes Bayes's return type neural network.Wherein, pattra leaves This network is returned as a network layer stack to the upper strata of BiLSTM return type neural networks so as to fulfill transverse direction using BiLSTM Compound neural network captures the temporal correlation of the long range of long-time between entity, realizes phase using Bayesian network in the longitudinal direction The analysis of closing property and reasoning.Meanwhile the reasoning results by feeding back Bayesian network are updated BiLSTM, are arrived so as to fulfill end The adaptive learning at end is established with relationship.

Bayes's return type neural network model is as shown in Figure 5 in one embodiment of the invention.

One embodiment of the invention trains entity attribute extraction model using following steps：

S1 is marked by word, is inputted word vector matrix (N*K) as BiLSTM, is obtained the mark class probability distribution of each word (N*4 matrixes).Wherein N is the length of each batch, and K is Embedding vector lengths, and the 4 classification number for word mark is maximum The position of value has corresponded to the label of current word.Have also obtained the word embedding of each word simultaneously at this time.

Embedding can be regarded as a space reflection (Mapping) mathematically：map(lambda y:F (x)), The characteristics of mapping is：(in mathematics, injective function is a function to injection, different arguments is connected in different values. More precisely, function f be known as injection when, f is caused to the y in each codomain, the x existed in an at most domain (x)=y.), mapping front-end geometry it is constant, correspond in word embedding concepts can be understood as find a function or Mapping, generates new expression spatially, the X spatial informations expressed by word one-hot be mapped to the hyperspace of Y to Amount.

Batch Size：Criticize size.There are three ways to the parameter updates in an embodiment of the present invention：

(1) Batch Gradient Descent, batch gradient decline, and traversal whole set of data calculates a loss function, Primary parameter update is carried out, the direction obtained in this way can more accurately be directed toward the direction of extreme value.

(2) Stochastic Gradient Descent, stochastic gradient descent calculate each sample primary loss Function carries out primary parameter update, and advantage is that speed is fast.

(3) Mini-batch Gradient Decent, small quantities of gradient decline, before two methods compromise, sample Data are divided into several batches, come counting loss function and undated parameter in batches, and such direction is more stable.S2 is according to sequence labelling As a result, the main body (subject) that event is obtained from text is candidate,

S2：Determine that main body (for those skilled in the art understand thoroughly known by interdependent syntactic analysis by syntax and part of speech analysis Common sense is not reinflated herein)；

S3:Event vector is defined as the following formula, wherein, eventEmbedding is event vector, and wj is represented in sentence j-th The vector of word, n represent the sentence within main body longitudinal separation n；

The mark class probability distribution of each word this article can be obtained by above-mentioned steps from training text or target text This event vector matrix.

It is marked by event, is inputted event vector matrix (N*K) as BiLSTM, obtain each event in training sample Mark class probability of happening distribution (N*L matrixes).Wherein N is the length of each batch, and K is Embedding vector lengths, and L is thing The classification number (repeating no more later) of part mark, the position of maximum value has corresponded to the label of current event.

The position of maximum value has corresponded to the label of current event, and both the event of maximum probability was judged as reality in probability distribution The result of body attribute extraction.

In an embodiment of the present invention, refer to the text set that same event type is labeled as in training sample by event mark It closes.

In an embodiment of the present invention, as shown in Figure 6 according to practical dependence, defining Bayesian network, both text describes The joint probability text of certain class event describes DAG (the directed acyclic graph Directed Acyclic of the joint probability of certain class event Graph) it is:

P (A, B, C, D)=P (D | A, B) * P (C | A) * P (B | A) P (A)

A is the probability whether text describes certain class event,

B is the successful probability of event extraction,

D is the probability of the vocabulary containing specific area,

C is the probability containing temporal information,

Wherein B events (extracting successful probability), in all events that can be by calculating language material, the label being calculated is It is no it is identical with training sample mark obtain, if identical B be assigned a value of 1 if differing B be assigned a value of 0.

If the label event and the label event phase of handmarking of the second two-way long short-term memory Recognition with Recurrent Neural Network output Together, then illustrate event extraction success, otherwise illustrate that event extraction is unsuccessful.

In an embodiment of the present invention, a training sample is inputted into BiLSTM and obtains the event category point of this sample Cloth, wherein the sample event are the maximum probability of accident, and both the sampling was accident event, if being thing to the mark of the sample Therefore then event extraction success B=1, represented if not being accident to the mark of the sample event extraction fail B=0

In an embodiment of the present invention, the probability that accident event contains specific area vocabulary is manually to be marked in sample database The sample size containing specific area vocabulary divided by the artificial sample for being labeled as accident are total in all samples that accident event occurs Quantity.

In an embodiment of the present invention, the probability that accident event contains temporal information is that manually mark is all in sample database The sample size containing temporal information divided by the artificial total sample number amount for being labeled as accident in the sample of accident event occurs.

Whether the matrix of Bayesian network output describes the probability distribution matrix of certain event for text；

Specifically, the above process can include two kinds of embodiments,

First embodiment：Before the second two-way long short-term memory Recognition with Recurrent Neural Network the first N*L dimensions are obtained to hidden layer Matrix, and the first N*L is tieed up into Input matrix Bayesian network, by the 2nd N*L dimension matrixes of Bayesian network output and the first N*L Tie up matrix perform Fusion Features, using Fusion Features result as after the second two-way long short-term memory Recognition with Recurrent Neural Network to hidden layer Input；

Specifically, first embodiment includes,

As shown in figure 8, the first N*L is obtained to hidden layer t moment before the second two-way long short-term memory Recognition with Recurrent Neural Network Matrix is tieed up, and the first N*L is tieed up into Input matrix Bayesian network, the 2nd N*L of Bayesian network output is tieed up into matrix and first N*L dimension matrix perform Fusion Features, using Fusion Features result as after the second two-way long short-term memory Recognition with Recurrent Neural Network to hidden The input of the t moment containing layer；

Those skilled in the art are it is to be understood that t moment refers to list entries t in the present invention, and Recognition with Recurrent Neural Network is at each moment There are one input Xt for meeting.

It is obtained before the second two-way long short-term memory Recognition with Recurrent Neural Network to the hidden layer t1 moment in other embodiments First N*L ties up matrix, and the first N*L is tieed up Input matrix Bayesian network, and the 2nd N*L of Bayesian network output is tieed up matrix Fusion Features are performed with the first N*L dimension matrixes, using Fusion Features result as the second two-way long short-term memory Recognition with Recurrent Neural Network The input at backward hidden layer t2 moment, t1 and t2 are different list entries；

Second embodiment：As shown in figure 9, obtain first from the second two-way long short-term memory Recognition with Recurrent Neural Network output layer First N*L is simultaneously tieed up Input matrix Bayesian network by N*L dimension matrixes, by the 2nd N*L dimension matrixes of Bayesian network output and the One N*L dimension matrixes perform Fusion Features, are inputted Fusion Features result as the second two-way long short-term memory Recognition with Recurrent Neural Network The input of layer；

In the present invention, Bayesian network is as a network layer stack to the upper strata of BiLSTM return type neural networks, Realize the temporal correlation that the long range of long-time between entity is laterally captured using BiLSTM return types neural network, in the longitudinal direction Correlation analysis and reasoning are realized using Bayesian network.Meanwhile by feeding back the reasoning results of Bayesian network to BiLSTM It is updated, is established so as to fulfill adaptive learning end to end and relationship.

It should be noted that two-way long short-term memory Recognition with Recurrent Neural Network output matrix and Bayesian network output matrix A kind of mode that arithmetic average is only matrix character fusion is taken, the present invention limits not to this, the side of matrix character fusion Formula can also include geometrical mean, mean square (root mean square average, rms), harmonic-mean, weighted average etc..

It is square with label for exporting for each timing nodes of BiLSTM that S4 defines loss function (loss function) Error (mean square error), iterative model to loss function convergence repeat step S3 to loss function convergence.

Entity attribute fusion steps are performed with reference to embodiment to target text to illustrate.

By target text input entity attribute extraction model is obtained target text entity attribute, all targets can be obtained The main body and its attribute structure of text, and obtain the distribution of the affiliated event category of target text：

Distribution=[p1, p2 ..., pL]

But in the event obtained for different data sources, it is possible to exist and describe same event mutually, but attribute extraction knot Fruit respectively has phenomena such as missing/conflict.Therefore present invention introduces convergence strategies, this is solved the problems, such as on the basis of event extraction.

The classification similitude that the present invention defines two events can use the similarity characterization (cosine similarity of their event distributions Deng).When the event of extraction is too many, traversing its similarity two-by-two, it will cause larger computing costs.Therefore thing is obtained Part candidate collection, and event set to be fused is chosen in candidate collection.

The primitive rule for choosing candidate collection is as follows：

Event body is identical

The similarity of event category distribution is high (Cosine Similarity)

Event time is close

For event candidate collection, it is also necessary to realize the Mutually fusion of attribute, the step depend on the time, main body, The matching degree of the attributes such as classification reaches the entity alignment of similar events.Attribute fusion steps are as follows：

A selectes the foundation structure of event solid data as substrate value according to the similitude with stay in place form

B traverses Candidate Set event, by the pairs of match attribute of tree depth-first sequence

C is when two events compare, it then follows following rule：

D repeat B~C until attribute can not continue it is perfect

Two events are drawn into two target texts in an embodiment of the present invention

Stay in place form is in the present embodiment

Foundation structure is in the present embodiment

Property value is eventType, tags, subject, time, tags in the present embodiment；

Multiple object table texts are by obtaining two events in an alternative embodiment of the invention after attribute extraction model：

Event 1：

Event 2：

Since above-mentioned two event has identical subject and identical time, both two events had identical structure Template obtains event 3 after being merged to event 1 and time 2

Multiple target texts are by obtaining two events in an alternative embodiment of the invention after attribute extraction model：

Event 4：

Event 5：

Two events are there are identical underlying structure in the present embodiment, but time attributes send out conflict, quality evaluation The time property values of function call outgoing event 5 are more excellent, therefore the time attributes of event 4 are replaced with time:2017-05-0800:00: 00。

Finally it should be noted that：The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although The present invention is described in detail with reference to foregoing embodiments, it will be understood by those of ordinary skill in the art that：It is still Technical solution recorded in foregoing embodiments modifies and either which part or all technical features is equally replaced It changes；And these modifications or replacement, the model for various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution It encloses, should all cover in the claim of the present invention and the range of specification.

Claims

A kind of 1. Company Knowledge collection of illustrative plates attribute extraction method, which is characterized in that include the following steps：

Define the entity class, event category, entity attribute structure of training sample；

Training sample language material prepares and mark；

Training entity attribute extraction model；

Target text input entity attribute extraction model is obtained into target text entity attribute；

Entity attribute fusion is performed to target text.
2. a kind of Company Knowledge collection of illustrative plates attribute extraction method according to claim 1, which is characterized in that

The entity class, event category, entity attribute structure for defining training sample includes,

It is enterprise's factor or/and individual factor to define entity class；

Definition event category for judgement document, law court's bulletin, announcement of court session, bidding, equity, strategy, occurrences in human life, finance, debt, It is a variety of or a kind of in product, marketing, brand, accident；

The field of defined attribute is type field, a variety of or a kind of in time field, tag field, body field；

Training sample language material prepares and mark includes, the event category of each text and entity attribute structure mark to training sample database Note.
3. a kind of Company Knowledge collection of illustrative plates attribute extraction method according to claim 1, which is characterized in that

Training entity attribute extraction model includes the following steps：

S1：It is marked by word, inputs, obtain using N*K dimension word vector matrixs as the first two-way long short-term memory Recognition with Recurrent Neural Network The N*T dimension mark class probability distribution matrixes of each word, wherein N is batch dimensional values, and K is embedded in vector length for word, and T is marked for word Classification number, the position of maximum value corresponds to the label of current word, and obtains the word embedding data of each word；

S2：Determine training sample main information；

S3:Event vector is defined as the following formula, wherein, eventEmbedding is event vector, w_jRepresent j-th word in sentence Vector, n represent the sentence within main body longitudinal separation n；

It is marked by event, using N*K dimension event vector matrixes as the second two-way long short-term memory Recognition with Recurrent Neural Network initial input, Wherein N is batch dimensional values, and K is embedded in vector length for word, and L is the classification number of event mark, and the position of maximum value, which has corresponded to, works as The label of preceding event；

Defining Bayes network is：

P (A, B, C, D)=P (D | A, B) * P (C | A) * P (B | A) P (A),

A is the probability whether text describes certain class event,

B is the successful probability of event extraction,

C is the probability containing temporal information,

D is the probability of the vocabulary containing specific area,

Wherein the value of B by N*L tie up mark class probability distribution matrix output label it is whether identical with training sample mark determine, if It is identical, B be assigned a value of 1 if differing B be assigned a value of 0,

The first N*L dimension matrixes are obtained from the second two-way long short-term memory Recognition with Recurrent Neural Network and the first N*L is tieed up into Input matrix shellfish The 2nd N*L dimension matrixes of Bayesian network output and the first N*L dimension matrixes are performed Fusion Features, feature are melted by this network of leaf It closes result and feeds back to the second two-way long short-term memory Recognition with Recurrent Neural Network；

S4：The output that loss function is defined as the two-way long each timing node of short-term memory Recognition with Recurrent Neural Network is beaten with training sample The mean square error of data is marked, repeats step S3 to loss function convergence.
4. a kind of Company Knowledge collection of illustrative plates attribute extraction method as claimed in any of claims 1 to 3, feature exist In,

Entity attribute extraction model, including,

The first N*L dimension matrixes are obtained to hidden layer and tie up the first N*L before the second two-way long short-term memory Recognition with Recurrent Neural Network The 2nd N*L dimension matrixes of Bayesian network output are performed feature with the first N*L dimension matrixes and melted by Input matrix Bayesian network It closes, using Fusion Features result as the input to hidden layer after the second two-way long short-term memory Recognition with Recurrent Neural Network；

Alternatively,

The first N*L dimension matrixes are obtained from the second two-way long short-term memory Recognition with Recurrent Neural Network output layer and the first N*L is tieed up into matrix Bayesian network is inputted, the 2nd N*L dimension matrixes of Bayesian network output and the first N*L dimension matrixes are performed into Fusion Features, it will Input of the Fusion Features result as the second two-way long short-term memory Recognition with Recurrent Neural Network input layer.
5. a kind of Company Knowledge collection of illustrative plates attribute extraction method according to claim 1, which is characterized in that

Entity attribute fusion is performed to target text to include the following steps：

A selectes the foundation structure of event solid data as substrate value according to the similitude with stay in place form；

B traverses Candidate Set event, by tree depth-first sequence match attribute；

C is when two events compare, it then follows following rule：

If existence foundation structure node attribute value lacks, directly supplement；

If in existence foundation structure, corresponding node attribute values clash, if quality evaluation functions obtain the attribute of Candidate Set Value is more excellent, and the non-null value of substrate is replaced；

If substrate attribute is listings format, increase the table of substrate non-duplicate element exclusive in Candidate Set；

D repeat step B and step C can not continue to attribute it is perfect.
6. a kind of Company Knowledge collection of illustrative plates attribute extraction system, which is characterized in that including with lower unit：

Definition unit, for defining the entity class of training sample, event category, entity attribute structure；

Mark unit, for the preparation of training sample language material and mark；

Training unit, for training entity attribute extraction model；

Entity attribute extracting unit, for target text input entity attribute extraction model to be obtained target text entity attribute；

Attribute integrated unit, for performing entity attribute fusion to target text.
7. a kind of Company Knowledge collection of illustrative plates attribute extraction system according to claim 6, which is characterized in that

The entity class, event category, entity attribute structure that definition unit defines training sample include,

It is enterprise's factor or/and individual factor to define entity class；

Definition event category for judgement document, law court's bulletin, announcement of court session, bidding, equity, strategy, occurrences in human life, finance, debt, It is a variety of or a kind of in product, marketing, brand, accident；

The field of defined attribute is type field, a variety of or a kind of in time field, tag field, body field；

The training sample language material prepares and mark includes to training sample database the event category of each text and entity attribute structure Mark.
8. a kind of Company Knowledge collection of illustrative plates attribute extraction system according to claim 6, which is characterized in that

Training unit trains entity attribute extraction model using following steps：

S1：It is marked by word, inputs, obtain using N*K dimension word vector matrixs as the first two-way long short-term memory Recognition with Recurrent Neural Network The N*T dimension mark class probability distribution matrixes of each word, wherein N is batch dimensional values, and K is embedded in vector length for word, and T is marked for word Classification number, the position of maximum value corresponds to the label of current word, and obtains the word embedding data of each word；

S2：Determine training sample main information；

S3:Event vector is defined as the following formula, wherein, eventEmbedding is event vector, w_jRepresent j-th word in sentence Vector, n represent the sentence within main body longitudinal separation n；

It is marked by event, using N*K dimension event vector matrixes as the second two-way long short-term memory Recognition with Recurrent Neural Network initial input, Wherein N is batch dimensional values, and K is embedded in vector length for word, and L is the classification number of event mark, and the position of maximum value, which has corresponded to, works as The label of preceding event；

Defining Bayes network is：

P (A, B, C, D)=P (D | A, B) * P (C | A) * P (B | A) P (A),

A is the probability whether text describes certain class event,

B is the successful probability of event extraction,

C is the probability containing temporal information,

D is the probability of the vocabulary containing specific area,

Wherein the value of B by N*L tie up mark class probability distribution matrix output label it is whether identical with training sample mark determine, if It is identical, B be assigned a value of 1 if differing B be assigned a value of 0,

The first N*L dimension matrixes are obtained from the second two-way long short-term memory Recognition with Recurrent Neural Network and the first N*L is tieed up into Input matrix shellfish The 2nd N*L dimension matrixes of Bayesian network output and the first N*L dimension matrixes are performed Fusion Features, feature are melted by this network of leaf It closes result and feeds back to the second two-way long short-term memory Recognition with Recurrent Neural Network；

S4：The output that loss function is defined as the two-way long each timing node of short-term memory Recognition with Recurrent Neural Network is beaten with training sample The mean square error of data is marked, repeats step S3 to loss function convergence.
9. a kind of Company Knowledge collection of illustrative plates attribute extraction system according to any one in claim 6 to 8, feature exist In,

Entity attribute extraction model, including,

The first N*L dimension matrixes are obtained to hidden layer before the second two-way long short-term memory Recognition with Recurrent Neural Network, and the first N*L is tieed up The 2nd N*L dimension matrixes of Bayesian network output are performed feature with the first N*L dimension matrixes and melted by Input matrix Bayesian network It closes, using Fusion Features result as the input to hidden layer after the second two-way long short-term memory Recognition with Recurrent Neural Network；

Alternatively,

The first N*L dimension matrixes are obtained from the second two-way long short-term memory Recognition with Recurrent Neural Network output layer and the first N*L is tieed up into matrix Bayesian network is inputted, the 2nd N*L dimension matrixes of Bayesian network output and the first N*L dimension matrixes are performed into Fusion Features, it will Input of the Fusion Features result as the second two-way long short-term memory Recognition with Recurrent Neural Network input layer.
10. a kind of Company Knowledge collection of illustrative plates attribute extraction system according to claim 6, which is characterized in that

Attribute integrated unit takes following steps to perform entity attribute fusion to target text：

A selectes the foundation structure of event solid data as substrate value according to the similitude with stay in place form；

B traverses Candidate Set event, by the pairs of match attribute of tree depth-first sequence；

C is when two events compare, it then follows following rule：

If existence foundation structure node attribute value lacks, directly supplement；

If in existence foundation structure, corresponding node attribute values clash, if quality evaluation functions obtain the attribute of Candidate Set Value is more excellent, and the non-null value of substrate is replaced；

If substrate attribute is listings format, increase the table of substrate non-duplicate element exclusive in Candidate Set；

D repeat step B and step C can not continue to attribute it is perfect.