CN112749283A

CN112749283A - Entity relationship joint extraction method for legal field

Info

Publication number: CN112749283A
Application number: CN202011625471.3A
Authority: CN
Inventors: 李参宏
Original assignee: Jiangsu Netmarch Technologies Co ltd
Current assignee: Jiangsu Netmarch Technologies Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-05-04

Abstract

The invention discloses a legal field-oriented entity relationship joint extraction method, which comprises the following steps: acquiring a text in the legal field, and carrying out entity relationship triple labeling on the text to construct a corpus in the legal field; vectorizing the text content to obtain the vector representation of the text; building a joint extraction model of the entity relationship of characters and involved objects in the legal field; training the established entity relation joint extraction model; and utilizing the relationship extraction model to extract the relationship between the characters and the involved articles of the unstructured documents in the legal field to be processed to obtain corresponding relationship triples. The invention has the beneficial effects that: the method can more accurately mine the relationship between the entities such as the characters, the involved articles and the like in the legal field.

Description

Entity relationship joint extraction method for legal field

Technical Field

The invention relates to the field of entity relationship joint extraction, in particular to an entity relationship joint extraction method for the legal field.

Background

The method has the advantages that a large number of entities such as characters and related articles exist in the legal field, the relationships among the entities are complicated, and the knowledge in the documents in the legal field is difficult to integrate efficiently, so that relationship triplets of the entities such as the characters and the related articles in the legal field are mined, a structured knowledge representation is formed for the documents related to the legal field, and accordingly, practitioners related to the law are helped to master cases more accurately and analyze cases, and support is provided for intelligent services such as case pushing, criminal prediction and the like in intelligent law.

At present, most of the relation extraction fields adopt a pipeline type extraction method, namely, entities are extracted first, and then the relationship prediction is carried out on the entities by pairwise combination. The pipeline type extraction method is easy to realize, but has the problem of error propagation, the performance of relation extraction depends on the performance of entity extraction to a certain extent, and the pipeline type extraction can generate a large amount of redundant data, so that the requirement of the legal field, particularly the requirement of the legal field for accurately mastering cases is difficult to meet.

Therefore, it is necessary to provide a new extraction method to solve the above problems.

Disclosure of Invention

To solve the problems set forth in the background art described above. The invention provides a legal-field-oriented entity relationship joint extraction method, which can more accurately mine the relationship among all entities such as characters, case-related objects and the like in the legal field and meet the requirement of accurately mastering case situations in the legal field.

In order to achieve the purpose, the invention provides the following technical scheme: a legal domain-oriented entity relationship joint extraction method comprises the following steps:

step S1: acquiring a text in the legal field, and carrying out entity relationship triple labeling on the text to construct a corpus in the legal field; step S2: vectorizing the text content in step S1 to obtain a vector representation of the text; step S3: building a joint extraction model of the entity relationship of characters and involved objects in the legal field; the method comprises the following specific steps: step S31: for each word vector x obtained in step S2_tThe Bi-LSTM of the coding layer is used for respectively obtaining the characteristic information of the documents in the legal field obtained by forward and backward propagation and respectively recording the characteristic information

Step (ii) ofS32: will be provided with

Splicing to obtain the characteristic vector of the coding layer at the time t, and recording the characteristic vector as

Step S33: inputting the feature vector of the coding layer in the step S13 into a CRF layer for identifying characters and involved object entities in the legal document; outputting the result of entity recognition and coding layer

As the input of the decoding layer Bi-LSTM at the time t, the semantic information of the document in the legal field is obtained by respectively calculating the forward propagation and the backward propagation in the same way and is respectively recorded as

Step S34: will be provided with

Splicing to obtain final semantic information

The semantic vector is obtained by analyzing the Bi-LSTM at the decoding layer at the moment t according to the information of the context of the document; step S35: taking the semantic vector obtained in the step 15 as the input of a softmax classifier, and classifying to obtain a relationship label of an entity pair; step S4: training the established entity relation joint extraction model; step S5: and utilizing the relationship extraction model to extract the relationship between the characters and the involved articles of the unstructured documents in the legal field to be processed to obtain corresponding relationship triples.

The step S1 specifically includes:

step S11: downloading legal documents from websites such as a Chinese judge document network and the like according to different legal case types; step S12: labeling entities such as characters, involved articles and the like of the document processed in the step S11, and labeling relations between the characters and the involved articles to form relation triples; step S13: and repeating the steps S11 to S12 until all sentences are labeled, namely constructing a corpus of the legal field.

The step S2 specifically includes: step S21: for each sentence in the legal document, taking a character as a basic unit, and performing one-hot representation on each character by using a one-hot coding mode to obtain one-hot representation of the sentence; step S22: taking the one-hot vector of the sentence as the input of the word2vec model, training the word2vec model, and continuously updating the weight matrix w by using a gradient descent algorithm; step S23: and multiplying the weight matrix obtained by training in the step S22 by the one-hot vector of each word to obtain the word embedding of each word, and finally obtaining the word embedding representation of the whole sentence.

The step S4 specifically includes: step S41: randomly dividing the legal domain corpus to make the proportion of the training set to the test set to be 7:3, train _ x, test _ x, train _ y, and test _ y to train _ test _ split (x, y, and test _ size to be 0.3); step S42: the negative log-likelihood function is selected as a loss function, and the loss function is composed of an entity cost function and a relation cost function because the model realizes the joint extraction of entity relations such as characters, involved articles and the like in the legal field, and the cost function is as follows:

where | S | represents the length of the sentence, e_i、r_iRepresenting entity labels and relation labels of the characters classified by the model, and theta represents a parameter set of the model; step S43: continuously updating the sharing parameter theta by using a random gradient descent algorithm; step S44: and training the model, and storing the trained model.

The step S5 specifically includes: step S51: testing the model by taking the test set obtained in the step S41 as the input of the model; step S22: and (4) performing performance evaluation on the relation triple result obtained in the step (S51), wherein the evaluation indexes adopt accuracy, recall rate and F1 value, and the evaluation method comprises the following steps:

wherein TP indicates the number of correctly classified classes, FP indicates the number of predicting negative classes as positive classes, and FN indicates the number of predicting positive classes as negative classes.

Compared with the prior art, the entity relationship joint extraction method oriented to the legal field has the beneficial effects that: the method can more accurately mine the relationship between the entities such as the characters, the involved articles and the like in the legal field.

Drawings

FIG. 1 is a flow chart of a method for jointly extracting entity relationships from the legal field according to the present invention;

FIG. 2 is a schematic flow chart of the method for building a joint extraction model of the physical relationship between characters and involved objects in the legal field;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1 to 2, the present invention provides a joint extraction method of entity relationship in the legal domain, which includes the following steps:

step S1: acquiring a text in the legal field, and carrying out entity relationship triple labeling on the text to construct a corpus in the legal field;

specifically, the step S1 specifically includes:

step S11: acquiring related texts in the legal field from websites such as a Chinese referee document network and the like according to different case types, and downloading legal documents;

step S12: labeling entities such as characters, involved articles and the like of the document processed in the step S11, and labeling relations between the characters and the involved articles to form relation triples;

step S13: and repeating the steps S11 to S12 until all sentences are labeled, namely constructing a corpus of the legal field.

Specifically, for the entity designation identification task, the constructed legal domain corpus is required to be utilized for benchmarking, that is, entities such as people and involved articles in the legal document are labeled with BIO by the relation triplet, which is implemented as follows: for any single corpus, firstly, matching out a corresponding entity index in a legal document in a regular matching mode; then labeling the matched entity name according to a BIO labeling method, namely B represents the first character of the entity name, I represents the middle or tail character of the entity name, and O represents other characters; finally, all other characters in the legal document are marked as O.

Step S2: vectorizing the text content in step S1 to obtain a vector representation of the text;

the step S2 specifically includes:

step S21: for each sentence in the legal document, taking a character as a basic unit, and performing one-hot representation on each character by using a one-hot coding mode to obtain one-hot representation of the sentence;

step S22: taking the one-hot vector of the sentence as the input of the word2vec model, training the word2vec model, and continuously updating the weight matrix w by using a gradient descent algorithm;

step S23: and multiplying the weight matrix obtained by training in the step S22 by the one-hot vector of each word to obtain the word embedding of each word, and finally obtaining the word embedding representation of the whole sentence.

Step S3: building a joint extraction model of the entity relationship of characters and involved objects in the legal field; the method comprises the following specific steps:

as shown in fig. 2, a schematic flow chart of the joint extraction model for building the physical relationship between the legal field characters and the involved objects is shown.

Step S31: for each word vector x obtained in step S2_tThe Bi-LSTM of the coding layer is used for respectively obtaining the characteristic information of the documents in the legal field obtained by forward and backward propagation and respectively recording the characteristic information

Step S32: will be provided with

Step S33: inputting the feature vector of the coding layer in the step S32 into a CRF layer for identifying characters and involved object entities in the legal document; outputting the result of entity recognition and coding layer

Step S34: will be provided with

Splicing to obtain final semantic information

I.e. decoding at time tThe layer Bi-LSTM analyzes the obtained semantic vector according to the information of the context of the document;

step S35: taking the semantic vector obtained in the step 34 as the input of a softmax classifier, and classifying to obtain a relationship label of an entity pair;

step S4: training the established entity relation joint extraction model;

specifically, the step S4 specifically includes:

step S41: randomly dividing the legal domain corpus to make the proportion of the training set to the test set to be 7:3, train _ x, test _ x, train _ y, and test _ y to train _ test _ split (x, y, and test _ size to be 0.3);

step S42: the negative log-likelihood function is selected as a loss function, and the loss function is composed of an entity cost function and a relation cost function because the model realizes the joint extraction of entity relations such as characters, involved articles and the like in the legal field, and the cost function is as follows:

where | S | represents the length of the sentence, e_i、r_iRepresenting entity labels and relation labels of the characters classified by the model, and theta represents a parameter set of the model;

step S43: continuously updating the sharing parameter theta by using a random gradient descent algorithm;

step S44: and training the model, and storing the trained model.

Step S5: and utilizing the relationship extraction model to extract the relationship between the characters and the involved articles of the unstructured documents in the legal field to be processed to obtain corresponding relationship triples.

Specifically, the step S5 specifically includes:

step S51: testing the model by taking the test set obtained in the step S41 as the input of the model;

step S22: and (4) performing performance evaluation on the relation triple result obtained in the step (S51), wherein the evaluation indexes adopt accuracy, recall rate and F1 value, and the evaluation method comprises the following steps:

Claims

1. A legal-field-oriented entity relationship joint extraction method is characterized by comprising the following steps:

Step S32: will be provided with

Step S34: will be provided with

Splicing to obtain final semantic information

The semantic vector is obtained by analyzing the Bi-LSTM at the decoding layer at the moment t according to the information of the context of the document;

step S35: taking the semantic vector obtained in the step S34 as the input of a softmax classifier, and classifying to obtain a relationship label of an entity pair;

step S4: training the established entity relation joint extraction model;

2. The method for joint extraction of entity relationships in the legal domain according to claim 1, wherein: the step S1 specifically includes:

step S11: downloading legal documents according to different case types;

3. The method for joint extraction of entity relationships in the legal domain according to claim 1, wherein: the step S2 specifically includes:

4. The method for joint extraction of entity relationships in the legal domain according to claim 1, wherein: the step S4 specifically includes:

step S44: and training the model, and storing the trained model.

5. The method for joint extraction of entity relationships in the legal domain according to claim 4, wherein: the step S5 specifically includes: