CN111460834B

CN111460834B - French semantic annotation method and device based on LSTM network

Info

Publication number: CN111460834B
Application number: CN202010273691.8A
Authority: CN
Inventors: 莫同; 李雨萌; 骆旭辉; 刘亚亭; 张艺璇
Original assignee: Beijing Peking University Software Engineering Co ltd
Current assignee: Beijing Peking University Software Engineering Co ltd
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2023-06-06
Anticipated expiration: 2040-04-09
Also published as: CN111460834A

Abstract

The invention relates to a method and a device for marking legal semantics based on an LSTM network, comprising the steps of obtaining a text to be analyzed; analyzing and processing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converting the part-of-speech labels into D-dimension word vectors, and inputting the D-dimension word vectors into a fully-connected neural network to obtain feature codes; comparing the part-of-speech labels of the text to be analyzed with the part-of-speech labels of the texts in a preset database to obtain a best-matching text, and obtaining a final vector representation; inputting the final vector representation into a fully connected neural network, and outputting semantic role labels of each word in the text to be analyzed; the invention can automatically analyze the factors such as the constructor, the receiver, the time, the place and the like in the legal laws, can assist related personnel to understand the semantics of the legal laws, provides support for higher-layer legal informatization application, and can effectively improve the working efficiency of staff.

Description

French semantic annotation method and device based on LSTM network

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a method and a device for marking French semantics based on an LSTM network.

Background

The existing shallow semantic analysis such as semantic role labeling method mostly needs to combine a certain degree of syntactic analysis or manually extracted features, and in the process of semantic analysis, certain error rate exists in the syntactic analysis, so that the subsequent semantic analysis result is wrong. The semantic role labeling task in natural language processing has many technical difficulties. With the rapid development of deep learning technology in recent years, the semantic role labeling effect of English and Chinese is greatly improved, and a good effect is achieved on data sets in multiple language fields.

However, with the increasing number of cases and laws in the judicial field, a great pressure is brought to personnel engaged in legal related work, and even professional lawyers are difficult to be familiar with all legal laws, and a great amount of time and effort are required in the process of acquiring case related content from massive legal texts, and the working efficiency is low. Therefore, the work of assisting relevant practitioners through artificial intelligence has become a highly desirable problem.

Disclosure of Invention

In view of the above, the invention aims to overcome the defects of the prior art, and provides a legal semantic annotation method and device based on an LSTM network, so as to solve the problems that a great deal of time and effort are required to be consumed and the working efficiency is low in the prior art for acquiring case related contents from massive legal texts.

In order to achieve the above purpose, the invention adopts the following technical scheme: a legal semantic annotation method based on an LSTM network comprises the following steps:

acquiring a text and preprocessing the text to acquire a text to be analyzed;

analyzing and processing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converting all the words into D-dimension word vectors by adopting a word vector model, and inputting all the D-dimension word vectors into a fully-connected neural network to obtain feature codes of all the words;

comparing the part-of-speech labels of the text to be analyzed with the part-of-speech labels of the texts in a preset database to obtain a best-matching text in the preset database, and vectorizing semantic role labels of the best-matching text and position information corresponding to the semantic role labels to obtain feature vectors;

compounding the feature codes with the feature vectors to obtain final vector representations;

and inputting the final vector representation into a fully-connected neural network, and outputting semantic role labels of each word in the text to be analyzed.

Further, the obtaining text and preprocessing the text to obtain the text to be analyzed includes:

normalizing the text to obtain a text to be analyzed in a standard data input form; the text to be analyzed in the standard data input form is the text of the specified center predicate.

Further, the center predicate includes:

administrative subject, administrative relative person, time, place.

Further, the analyzing the text to be analyzed to obtain all words of the text to be analyzed and the part-of-speech tags corresponding to the words includes:

splitting the text to be analyzed according to a legal dictionary by adopting a Chinese word segmentation tool and a part-of-speech tagging tool;

and acquiring all words of the analysis text and part-of-speech tags corresponding to the words.

Further, inputting all the D-dimensional word vectors into a fully connected neural network to obtain feature codes of all the words, including:

all D-dimension word vectors are sequentially input into a fully-connected neural network, the fully-connected neural network is provided with a feature encoder, the feature encoder comprises a 4-layer stacked bidirectional LSTM, and the method comprises the following steps: first layer LSTM, second layer LSTM, third layer LSTM, fourth layer LSTM;

the first layer LSTM takes the D-dimension word vector as input for encoding, then the input of each layer LSTM is the output of the upper layer, and the fourth layer LSTM outputs feature encoding.

Further, the comparing the part of speech tag of the text to be analyzed with the part of speech tag of the text in a preset database to obtain the best matching text in the preset database includes:

matching character strings to two sides by taking a central predicate as a center to mark the part of speech of a text to be analyzed and the part of speech mark of the text in a preset database;

and calculating the matching degree according to the matching length of the character strings, and obtaining the best matching text.

Further, the semantic role labels of the best matching text and the position information corresponding to the semantic role labels are vectorized to obtain feature vectors,

vectorizing the semantic role labels of the most matched text to obtain a first vector representation;

vectorizing the distance between the semantic role labels and the central predicates to obtain a second vector representation;

the first vector representation and the second vector representation are composited into a feature vector.

Further, the inputting the final vector representation into a fully connected neural network, outputting semantic role labels of each word in the text to be analyzed, and the method comprises the following steps:

inputting the final vector into a fully connected neural network, wherein a softmax layer is arranged in the fully connected neural network, the softmax layer adopts a softmax classifier to carry out semantic role labeling on each word, and the softmax layer outputs the semantic role labeling.

Further, the word vector model includes:

word2vec language model, glove language model, or BERT language model.

The embodiment of the application provides a legal meaning annotation device based on an LSTM network, which comprises:

the preprocessing module is used for acquiring a text and preprocessing the text to acquire a text to be analyzed;

the first processing module is used for analyzing and processing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converting all the words into D-dimension word vectors by adopting a word vector model, and inputting all the D-dimension word vectors into a fully-connected neural network to obtain feature codes of all the words;

the second processing module is used for comparing the part-of-speech labels of the texts to be analyzed with the part-of-speech labels of the texts in a preset database to obtain the most matched texts in the preset database, and vectorizing semantic role labels of the most matched texts and position information corresponding to the semantic role labels to obtain feature vectors;

the acquisition module is used for compositing the feature codes with the feature vectors to acquire final vector representations;

and the output module is used for inputting the final vector representation into a fully-connected neural network and outputting semantic role labels of each word in the text to be analyzed.

By adopting the technical scheme, the invention has the following beneficial effects:

according to the method, firstly, a French text is vectorized and a part-of-speech tagging result is predicted, secondly, a most similar French is calculated in a database based on the part-of-speech tagging result, a vector representation of the French semantic role tagging is obtained, and finally, data are input into an LSTM network to obtain the semantic role tagging of each word. The method for deep learning is applied in the vectorization process, and has certain expandability.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of steps of a legal semantic annotation method based on an LSTM network;

fig. 2 is a schematic structural diagram of a legal semantic annotation device based on an LSTM network.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, based on the examples herein, which are within the scope of the invention as defined by the claims, will be within the scope of the invention as defined by the claims.

A specific method for labeling French semantics based on LSTM network provided in the embodiments of the present application is described below with reference to the accompanying drawings.

As shown in fig. 1, the method for labeling legal meaning based on LSTM network provided in the embodiment of the present application includes:

s101, acquiring a text and preprocessing the text to acquire a text to be analyzed;

the method is mainly applied to staff to review legal laws, firstly, legal texts are obtained, and preprocessing is carried out on the legal texts, wherein preprocessing is to normalize the texts, the texts are processed to obtain standard data input forms, namely, center predicates in the text are appointed for each input text, and the text with the appointed center predicates is the text to be analyzed.

S102, analyzing and processing a text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converting all the words into D-dimension word vectors by adopting a word vector model, and inputting all the D-dimension word vectors into a fully-connected neural network to obtain feature codes of all the words;

and splitting the obtained text to be analyzed into words, namely splitting the text into a plurality of words, forming part-of-speech labels corresponding to each word, vectorizing each word by adopting a word vector model, converting each word into D-dimension word vectors, inputting each D-dimension word vector into a fully-connected neural network for training, and obtaining feature codes of each word. The D-dimensional word vector is a vector with dimension D representing a chinese word.

The method and the device have the advantages that the existing word vector model is adopted, no special requirements exist, and parameters of the word vector model can be adjusted accordingly, so that word recognition is more accurate and recognition speed is faster.

S103, comparing the part-of-speech labels of the text to be analyzed with the part-of-speech labels of the texts in a preset database to obtain a best-matching text in the preset database, and vectorizing semantic role labels of the best-matching text and position information corresponding to the semantic role labels to obtain feature vectors;

the method comprises the steps of presetting a database, wherein a legal text is arranged in the database, part-of-speech tagging is carried out on the legal text in the database, comparing the part-of-speech tagging in a given text to be analyzed with the part-of-speech tagging in the legal text in the database, finding out the legal text in the database with the highest matching degree as the best matching text, and vectorizing the semantic role tagging and the position information corresponding to the semantic role tagging in the best matching text to obtain a feature vector.

S104, compositing the feature codes and the feature vectors to obtain a final vector representation;

and (3) splicing and compositing the feature codes of each word obtained in the step (S102) and the feature vectors of the words corresponding to the best matching text obtained in the step (S103) to obtain a final vector representation.

S105, inputting the final vector representation into a fully-connected neural network, and outputting semantic role labels of each word in the text to be analyzed.

And inputting the final vector identification into a fully-connected neural network, identifying through a softmax classifier in the fully-connected neural network, and finally outputting semantic role labels of each word in the text to be analyzed.

The method for labeling the legal semantics based on the LSTM network has the working principle that: firstly vectorizing a legal text and predicting a part-of-speech tagging result, secondly calculating the most similar legal text in a database based on the part-of-speech tagging result to obtain vector representation of the legal semantic role tagging, and finally inputting data into an LSTM network to obtain semantic role tagging of each word. The method of deep learning is applied in the vectorization process, and has certain expandability. The invention can automatically analyze the factors such as the constructor, the receiver, the time, the place and the like in the legal laws, can assist related personnel to understand the semantics of the legal laws, provides support for higher-layer legal informatization application, and can effectively improve the working efficiency of staff.

In some embodiments, obtaining text and preprocessing the text to obtain text to be analyzed includes:

normalizing the text to obtain a text to be analyzed in a standard data input form; the text to be analyzed in the form of a standard data input is the text specifying the center predicate.

Preferably, the center predicate includes:

administrative subject, administrative relative person, time, place.

In some embodiments, analyzing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech tags corresponding to the words, including:

and acquiring part-of-speech tags corresponding to all words of the analysis text.

Specifically, a Chinese word segmentation tool is adopted to segment a French text, all words in the text are obtained, a part-of-speech tagging tool is adopted to tag all words obtained by a clockwork text, and part-of-speech tagging corresponding to the words is obtained. And the text to be analyzed is split according to the legal dictionary, so that related words in the legal dictionary are obtained. For example: administrative subject, administrative relative person, time, place, etc.

It should be noted that, the chinese word segmentation tool and the part of speech tagging tool adopted in the present application are both in the prior art, and are not described herein in detail.

In some embodiments, inputting all D-dimensional word vectors into the fully connected neural network obtains feature encodings of all words, including:

all D-dimension word vectors are sequentially input into a fully connected neural network, the fully connected neural network is provided with a feature encoder, the feature encoder comprises a 4-layer stacked bidirectional LSTM, and the method comprises the following steps: first layer LSTM, second layer LSTM, third layer LSTM, fourth layer LSTM;

the first layer LSTM is encoded with the D-dimension word vector as input, then the input of each layer LSTM is the output of the upper layer, and the fourth layer LSTM outputs the feature code.

Specifically, all D-dimension word vectors are sequentially input into a feature encoder formed by a bidirectional LSTM structure, wherein the feature encoder is formed by 4 stacked bidirectional LSTMs, and comprises a first layer LSTM, a second layer LSTM, a third layer LSTM and a fourth layer LSTM; the first layer LSTM takes D-dimension vector as input for coding, the second layer LSTM takes output of the first layer LSTM as input, then the input of each layer LSTM is output of the upper layer, finally, the fourth layer LSTM outputs feature code W _i . In order to improve the gradient disappearance phenomenon of the multilayer LSTM structure, a highway LSTM structure is introduced in the present application.

In some embodiments, comparing the part of speech tag of the text to be analyzed with the part of speech tag of the text in the preset database to obtain the best matching text in the preset database includes:

Preferably, the semantic role labels of the most matched texts and the position information corresponding to the semantic role labels are vectorized to obtain feature vectors,

vectorizing semantic role labels of the most matched text to obtain a first vector representation;

the first vector representation and the second vector representation are combined into a feature vector.

Specifically, according to the part-of-speech tags of a given French text, searching the tagged data set for the French text most similar to the part-of-speech tags as a template. Calculating the similarity of the text S to be analyzed and the French text Si in the database D; in particular, longest string matching is usedThe method of (2) takes the central predicate V as the center, and matches the longest character string length Li to the two sides, so as to best match the text S _sim 。

S _sim ＝argmax(L _i )

And vectorizing the semantic role labeling result of the best-matching text Ssims to obtain the vector representation of the best-matching text. Specifically, vectorization is carried out on semantic role labeling results in the best-matching text, and dim 1-dimensional vector representation Rsim is obtained. And simultaneously, encoding the relative distance between each semantic role and the central predicate to obtain a dim 2-dimensional vector representation PEsim. And splicing the vectors Rsim and PESim with the feature codes of the obtained text to be analyzed to obtain a final vector representation.

In some embodiments, inputting the final vector representation into the fully connected neural network, outputting semantic role labels for each word in the text to be analyzed, comprising:

the final vector is input into a fully connected neural network, a softmax layer is arranged in the fully connected neural network, the softmax layer adopts a softmax classifier to carry out semantic role labeling on each word, and the softmax layer outputs the semantic role labeling.

Specifically, the output Wi of the last layer of bidirectional LSTM is taken to be spliced with the vectors Rsim and PESim obtained in the step S103, and a final vector representation [ Wi ] is obtained; rsim; PEsim ], which is input into a fully connected neural network, and then is subjected to a softmax layer to obtain a multi-classification result. The output of the Softmax layer is the semantic role label of each word in the text to be analyzed relative to a given predicate.

Preferably, the word vector model provided in the present application includes:

word2vec language model, glove language model, or BERT language model.

As shown in fig. 2, the present application provides a legal meaning labeling device based on LSTM network, including:

a preprocessing module 201, configured to obtain a text and preprocess the text to obtain a text to be analyzed;

the first processing module 202 is configured to perform analysis processing on a text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, convert all the words into D-dimensional word vectors by using a word vector model, and input all the D-dimensional word vectors into a fully connected neural network to obtain feature codes of all the words;

the second processing module 203 is configured to compare the part-of-speech tag of the text to be analyzed with the part-of-speech tag of the text in the preset database to obtain a most-matched text in the preset database, and vectorize the semantic role tag of the most-matched text and the position information corresponding to the semantic role tag to obtain a feature vector;

an obtaining module 204, configured to compound the feature code with the feature vector, and obtain a final vector representation;

and the output module 205 is used for inputting the final vector representation into the fully-connected neural network and outputting semantic role labels of each word in the text to be analyzed.

The working principle of the legal semantic annotation device based on the LSTM network is that a preprocessing module 201 acquires a text and preprocesses the text to acquire the text to be analyzed; the first processing module 202 analyzes and processes the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converts all the words into D-dimension word vectors by adopting a word vector model, and inputs all the D-dimension word vectors into a fully-connected neural network to obtain feature codes of all the words; the second processing module 203 compares the part-of-speech labels of the text to be analyzed with the part-of-speech labels of the text in the preset database to obtain the most matched text in the preset database, and vectorizes the semantic role labels of the most matched text and the position information corresponding to the semantic role labels to obtain feature vectors; the obtaining module 204 combines the feature codes with the feature vectors to obtain a final vector representation; the output module 205 inputs the final vector representation into the fully connected neural network and outputs semantic role labels for each word in the text to be analyzed.

In summary, the invention provides a method and a device for marking legal semantics based on an LSTM network, which comprise the steps of obtaining a text and preprocessing the text to obtain the text to be analyzed; analyzing and processing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converting all the words into D-dimension word vectors by adopting a word vector model, and inputting all the D-dimension word vectors into a fully-connected neural network to obtain feature codes of all the words; comparing the part-of-speech labels of the text to be analyzed with the part-of-speech labels of the text in a preset database to obtain a most matched text in the preset database, and vectorizing semantic role labels of the most matched text and position information corresponding to the semantic role labels to obtain feature vectors; compounding the feature codes with the feature vectors to obtain final vector representation; and inputting the final vector representation into a fully-connected neural network, and outputting semantic role labels of each word in the text to be analyzed. The method and the system can automatically analyze the factors such as the constructors, the acceptors, the time, the places and the like in the legal laws, can assist related personnel to understand the semantics of the laws, provide support for higher-layer legal informatization application, and can effectively improve the working efficiency of staff.

It can be understood that the above-provided device embodiments correspond to the above-described method embodiments, and corresponding specific details may be referred to each other, which is not described herein again.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. The method for marking the legal meaning based on the LSTM network is characterized by comprising the following steps of:

acquiring a text and preprocessing the text to acquire a text to be analyzed;

2. The method of claim 1, wherein the obtaining text and preprocessing the text to obtain text to be analyzed comprises:

3. The method of claim 2, wherein the center predicate includes:

administrative subject, administrative relative person, time, place.

4. The method according to claim 1, wherein the analyzing the text to be analyzed to obtain all words of the text to be analyzed and the part-of-speech tags corresponding to the words includes:

5. The method of claim 1, wherein inputting all of the D-dimensional word vectors into a fully connected neural network obtains feature encodings of all of the words, comprising:

6. The method according to claim 2, wherein comparing the part-of-speech tags of the text to be analyzed with the part-of-speech tags of the texts in a predetermined database to obtain the best matching text in the predetermined database comprises:

7. The method of claim 6, wherein the matching text semantic role labels and the corresponding location information of the semantic role labels are vectorized to obtain feature vectors,

8. The method of claim 1, wherein inputting the final vector representation into a fully connected neural network, outputting semantic role labels for each word in the text to be analyzed, comprises:

9. The method of any one of claims 1 to 8, wherein the word vector model comprises:

word2vec language model, glove language model, or BERT language model.

10. The utility model provides a legal meaning annotate device based on LSTM network which characterized in that includes: