CN111460834B - French semantic annotation method and device based on LSTM network - Google Patents

French semantic annotation method and device based on LSTM network Download PDF

Info

Publication number
CN111460834B
CN111460834B CN202010273691.8A CN202010273691A CN111460834B CN 111460834 B CN111460834 B CN 111460834B CN 202010273691 A CN202010273691 A CN 202010273691A CN 111460834 B CN111460834 B CN 111460834B
Authority
CN
China
Prior art keywords
text
analyzed
words
labels
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010273691.8A
Other languages
Chinese (zh)
Other versions
CN111460834A (en
Inventor
莫同
李雨萌
骆旭辉
刘亚亭
张艺璇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Peking University Software Engineering Co ltd
Original Assignee
Beijing Peking University Software Engineering Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Peking University Software Engineering Co ltd filed Critical Beijing Peking University Software Engineering Co ltd
Priority to CN202010273691.8A priority Critical patent/CN111460834B/en
Publication of CN111460834A publication Critical patent/CN111460834A/en
Application granted granted Critical
Publication of CN111460834B publication Critical patent/CN111460834B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a method and a device for marking legal semantics based on an LSTM network, comprising the steps of obtaining a text to be analyzed; analyzing and processing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converting the part-of-speech labels into D-dimension word vectors, and inputting the D-dimension word vectors into a fully-connected neural network to obtain feature codes; comparing the part-of-speech labels of the text to be analyzed with the part-of-speech labels of the texts in a preset database to obtain a best-matching text, and obtaining a final vector representation; inputting the final vector representation into a fully connected neural network, and outputting semantic role labels of each word in the text to be analyzed; the invention can automatically analyze the factors such as the constructor, the receiver, the time, the place and the like in the legal laws, can assist related personnel to understand the semantics of the legal laws, provides support for higher-layer legal informatization application, and can effectively improve the working efficiency of staff.

Description

French semantic annotation method and device based on LSTM network
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a method and a device for marking French semantics based on an LSTM network.
Background
The existing shallow semantic analysis such as semantic role labeling method mostly needs to combine a certain degree of syntactic analysis or manually extracted features, and in the process of semantic analysis, certain error rate exists in the syntactic analysis, so that the subsequent semantic analysis result is wrong. The semantic role labeling task in natural language processing has many technical difficulties. With the rapid development of deep learning technology in recent years, the semantic role labeling effect of English and Chinese is greatly improved, and a good effect is achieved on data sets in multiple language fields.
However, with the increasing number of cases and laws in the judicial field, a great pressure is brought to personnel engaged in legal related work, and even professional lawyers are difficult to be familiar with all legal laws, and a great amount of time and effort are required in the process of acquiring case related content from massive legal texts, and the working efficiency is low. Therefore, the work of assisting relevant practitioners through artificial intelligence has become a highly desirable problem.
Disclosure of Invention
In view of the above, the invention aims to overcome the defects of the prior art, and provides a legal semantic annotation method and device based on an LSTM network, so as to solve the problems that a great deal of time and effort are required to be consumed and the working efficiency is low in the prior art for acquiring case related contents from massive legal texts.
In order to achieve the above purpose, the invention adopts the following technical scheme: a legal semantic annotation method based on an LSTM network comprises the following steps:
acquiring a text and preprocessing the text to acquire a text to be analyzed;
analyzing and processing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converting all the words into D-dimension word vectors by adopting a word vector model, and inputting all the D-dimension word vectors into a fully-connected neural network to obtain feature codes of all the words;
comparing the part-of-speech labels of the text to be analyzed with the part-of-speech labels of the texts in a preset database to obtain a best-matching text in the preset database, and vectorizing semantic role labels of the best-matching text and position information corresponding to the semantic role labels to obtain feature vectors;
compounding the feature codes with the feature vectors to obtain final vector representations;
and inputting the final vector representation into a fully-connected neural network, and outputting semantic role labels of each word in the text to be analyzed.
Further, the obtaining text and preprocessing the text to obtain the text to be analyzed includes:
normalizing the text to obtain a text to be analyzed in a standard data input form; the text to be analyzed in the standard data input form is the text of the specified center predicate.
Further, the center predicate includes:
administrative subject, administrative relative person, time, place.
Further, the analyzing the text to be analyzed to obtain all words of the text to be analyzed and the part-of-speech tags corresponding to the words includes:
splitting the text to be analyzed according to a legal dictionary by adopting a Chinese word segmentation tool and a part-of-speech tagging tool;
and acquiring all words of the analysis text and part-of-speech tags corresponding to the words.
Further, inputting all the D-dimensional word vectors into a fully connected neural network to obtain feature codes of all the words, including:
all D-dimension word vectors are sequentially input into a fully-connected neural network, the fully-connected neural network is provided with a feature encoder, the feature encoder comprises a 4-layer stacked bidirectional LSTM, and the method comprises the following steps: first layer LSTM, second layer LSTM, third layer LSTM, fourth layer LSTM;
the first layer LSTM takes the D-dimension word vector as input for encoding, then the input of each layer LSTM is the output of the upper layer, and the fourth layer LSTM outputs feature encoding.
Further, the comparing the part of speech tag of the text to be analyzed with the part of speech tag of the text in a preset database to obtain the best matching text in the preset database includes:
matching character strings to two sides by taking a central predicate as a center to mark the part of speech of a text to be analyzed and the part of speech mark of the text in a preset database;
and calculating the matching degree according to the matching length of the character strings, and obtaining the best matching text.
Further, the semantic role labels of the best matching text and the position information corresponding to the semantic role labels are vectorized to obtain feature vectors,
vectorizing the semantic role labels of the most matched text to obtain a first vector representation;
vectorizing the distance between the semantic role labels and the central predicates to obtain a second vector representation;
the first vector representation and the second vector representation are composited into a feature vector.
Further, the inputting the final vector representation into a fully connected neural network, outputting semantic role labels of each word in the text to be analyzed, and the method comprises the following steps:
inputting the final vector into a fully connected neural network, wherein a softmax layer is arranged in the fully connected neural network, the softmax layer adopts a softmax classifier to carry out semantic role labeling on each word, and the softmax layer outputs the semantic role labeling.
Further, the word vector model includes:
word2vec language model, glove language model, or BERT language model.
The embodiment of the application provides a legal meaning annotation device based on an LSTM network, which comprises:
the preprocessing module is used for acquiring a text and preprocessing the text to acquire a text to be analyzed;
the first processing module is used for analyzing and processing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converting all the words into D-dimension word vectors by adopting a word vector model, and inputting all the D-dimension word vectors into a fully-connected neural network to obtain feature codes of all the words;
the second processing module is used for comparing the part-of-speech labels of the texts to be analyzed with the part-of-speech labels of the texts in a preset database to obtain the most matched texts in the preset database, and vectorizing semantic role labels of the most matched texts and position information corresponding to the semantic role labels to obtain feature vectors;
the acquisition module is used for compositing the feature codes with the feature vectors to acquire final vector representations;
and the output module is used for inputting the final vector representation into a fully-connected neural network and outputting semantic role labels of each word in the text to be analyzed.
By adopting the technical scheme, the invention has the following beneficial effects:
according to the method, firstly, a French text is vectorized and a part-of-speech tagging result is predicted, secondly, a most similar French is calculated in a database based on the part-of-speech tagging result, a vector representation of the French semantic role tagging is obtained, and finally, data are input into an LSTM network to obtain the semantic role tagging of each word. The method for deep learning is applied in the vectorization process, and has certain expandability.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of steps of a legal semantic annotation method based on an LSTM network;
fig. 2 is a schematic structural diagram of a legal semantic annotation device based on an LSTM network.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, based on the examples herein, which are within the scope of the invention as defined by the claims, will be within the scope of the invention as defined by the claims.
A specific method for labeling French semantics based on LSTM network provided in the embodiments of the present application is described below with reference to the accompanying drawings.
As shown in fig. 1, the method for labeling legal meaning based on LSTM network provided in the embodiment of the present application includes:
s101, acquiring a text and preprocessing the text to acquire a text to be analyzed;
the method is mainly applied to staff to review legal laws, firstly, legal texts are obtained, and preprocessing is carried out on the legal texts, wherein preprocessing is to normalize the texts, the texts are processed to obtain standard data input forms, namely, center predicates in the text are appointed for each input text, and the text with the appointed center predicates is the text to be analyzed.
S102, analyzing and processing a text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converting all the words into D-dimension word vectors by adopting a word vector model, and inputting all the D-dimension word vectors into a fully-connected neural network to obtain feature codes of all the words;
and splitting the obtained text to be analyzed into words, namely splitting the text into a plurality of words, forming part-of-speech labels corresponding to each word, vectorizing each word by adopting a word vector model, converting each word into D-dimension word vectors, inputting each D-dimension word vector into a fully-connected neural network for training, and obtaining feature codes of each word. The D-dimensional word vector is a vector with dimension D representing a chinese word.
The method and the device have the advantages that the existing word vector model is adopted, no special requirements exist, and parameters of the word vector model can be adjusted accordingly, so that word recognition is more accurate and recognition speed is faster.
S103, comparing the part-of-speech labels of the text to be analyzed with the part-of-speech labels of the texts in a preset database to obtain a best-matching text in the preset database, and vectorizing semantic role labels of the best-matching text and position information corresponding to the semantic role labels to obtain feature vectors;
the method comprises the steps of presetting a database, wherein a legal text is arranged in the database, part-of-speech tagging is carried out on the legal text in the database, comparing the part-of-speech tagging in a given text to be analyzed with the part-of-speech tagging in the legal text in the database, finding out the legal text in the database with the highest matching degree as the best matching text, and vectorizing the semantic role tagging and the position information corresponding to the semantic role tagging in the best matching text to obtain a feature vector.
S104, compositing the feature codes and the feature vectors to obtain a final vector representation;
and (3) splicing and compositing the feature codes of each word obtained in the step (S102) and the feature vectors of the words corresponding to the best matching text obtained in the step (S103) to obtain a final vector representation.
S105, inputting the final vector representation into a fully-connected neural network, and outputting semantic role labels of each word in the text to be analyzed.
And inputting the final vector identification into a fully-connected neural network, identifying through a softmax classifier in the fully-connected neural network, and finally outputting semantic role labels of each word in the text to be analyzed.
The method for labeling the legal semantics based on the LSTM network has the working principle that: firstly vectorizing a legal text and predicting a part-of-speech tagging result, secondly calculating the most similar legal text in a database based on the part-of-speech tagging result to obtain vector representation of the legal semantic role tagging, and finally inputting data into an LSTM network to obtain semantic role tagging of each word. The method of deep learning is applied in the vectorization process, and has certain expandability. The invention can automatically analyze the factors such as the constructor, the receiver, the time, the place and the like in the legal laws, can assist related personnel to understand the semantics of the legal laws, provides support for higher-layer legal informatization application, and can effectively improve the working efficiency of staff.
In some embodiments, obtaining text and preprocessing the text to obtain text to be analyzed includes:
normalizing the text to obtain a text to be analyzed in a standard data input form; the text to be analyzed in the form of a standard data input is the text specifying the center predicate.
Preferably, the center predicate includes:
administrative subject, administrative relative person, time, place.
In some embodiments, analyzing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech tags corresponding to the words, including:
splitting the text to be analyzed according to a legal dictionary by adopting a Chinese word segmentation tool and a part-of-speech tagging tool;
and acquiring part-of-speech tags corresponding to all words of the analysis text.
Specifically, a Chinese word segmentation tool is adopted to segment a French text, all words in the text are obtained, a part-of-speech tagging tool is adopted to tag all words obtained by a clockwork text, and part-of-speech tagging corresponding to the words is obtained. And the text to be analyzed is split according to the legal dictionary, so that related words in the legal dictionary are obtained. For example: administrative subject, administrative relative person, time, place, etc.
It should be noted that, the chinese word segmentation tool and the part of speech tagging tool adopted in the present application are both in the prior art, and are not described herein in detail.
In some embodiments, inputting all D-dimensional word vectors into the fully connected neural network obtains feature encodings of all words, including:
all D-dimension word vectors are sequentially input into a fully connected neural network, the fully connected neural network is provided with a feature encoder, the feature encoder comprises a 4-layer stacked bidirectional LSTM, and the method comprises the following steps: first layer LSTM, second layer LSTM, third layer LSTM, fourth layer LSTM;
the first layer LSTM is encoded with the D-dimension word vector as input, then the input of each layer LSTM is the output of the upper layer, and the fourth layer LSTM outputs the feature code.
Specifically, all D-dimension word vectors are sequentially input into a feature encoder formed by a bidirectional LSTM structure, wherein the feature encoder is formed by 4 stacked bidirectional LSTMs, and comprises a first layer LSTM, a second layer LSTM, a third layer LSTM and a fourth layer LSTM; the first layer LSTM takes D-dimension vector as input for coding, the second layer LSTM takes output of the first layer LSTM as input, then the input of each layer LSTM is output of the upper layer, finally, the fourth layer LSTM outputs feature code W i . In order to improve the gradient disappearance phenomenon of the multilayer LSTM structure, a highway LSTM structure is introduced in the present application.
In some embodiments, comparing the part of speech tag of the text to be analyzed with the part of speech tag of the text in the preset database to obtain the best matching text in the preset database includes:
matching character strings to two sides by taking a central predicate as a center to mark the part of speech of a text to be analyzed and the part of speech mark of the text in a preset database;
and calculating the matching degree according to the matching length of the character strings, and obtaining the best matching text.
Preferably, the semantic role labels of the most matched texts and the position information corresponding to the semantic role labels are vectorized to obtain feature vectors,
vectorizing semantic role labels of the most matched text to obtain a first vector representation;
vectorizing the distance between the semantic role labels and the central predicates to obtain a second vector representation;
the first vector representation and the second vector representation are combined into a feature vector.
Specifically, according to the part-of-speech tags of a given French text, searching the tagged data set for the French text most similar to the part-of-speech tags as a template. Calculating the similarity of the text S to be analyzed and the French text Si in the database D; in particular, longest string matching is usedThe method of (2) takes the central predicate V as the center, and matches the longest character string length Li to the two sides, so as to best match the text S sim
S sim =argmax(L i )
And vectorizing the semantic role labeling result of the best-matching text Ssims to obtain the vector representation of the best-matching text. Specifically, vectorization is carried out on semantic role labeling results in the best-matching text, and dim 1-dimensional vector representation Rsim is obtained. And simultaneously, encoding the relative distance between each semantic role and the central predicate to obtain a dim 2-dimensional vector representation PEsim. And splicing the vectors Rsim and PESim with the feature codes of the obtained text to be analyzed to obtain a final vector representation.
In some embodiments, inputting the final vector representation into the fully connected neural network, outputting semantic role labels for each word in the text to be analyzed, comprising:
the final vector is input into a fully connected neural network, a softmax layer is arranged in the fully connected neural network, the softmax layer adopts a softmax classifier to carry out semantic role labeling on each word, and the softmax layer outputs the semantic role labeling.
Specifically, the output Wi of the last layer of bidirectional LSTM is taken to be spliced with the vectors Rsim and PESim obtained in the step S103, and a final vector representation [ Wi ] is obtained; rsim; PEsim ], which is input into a fully connected neural network, and then is subjected to a softmax layer to obtain a multi-classification result. The output of the Softmax layer is the semantic role label of each word in the text to be analyzed relative to a given predicate.
Preferably, the word vector model provided in the present application includes:
word2vec language model, glove language model, or BERT language model.
As shown in fig. 2, the present application provides a legal meaning labeling device based on LSTM network, including:
a preprocessing module 201, configured to obtain a text and preprocess the text to obtain a text to be analyzed;
the first processing module 202 is configured to perform analysis processing on a text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, convert all the words into D-dimensional word vectors by using a word vector model, and input all the D-dimensional word vectors into a fully connected neural network to obtain feature codes of all the words;
the second processing module 203 is configured to compare the part-of-speech tag of the text to be analyzed with the part-of-speech tag of the text in the preset database to obtain a most-matched text in the preset database, and vectorize the semantic role tag of the most-matched text and the position information corresponding to the semantic role tag to obtain a feature vector;
an obtaining module 204, configured to compound the feature code with the feature vector, and obtain a final vector representation;
and the output module 205 is used for inputting the final vector representation into the fully-connected neural network and outputting semantic role labels of each word in the text to be analyzed.
The working principle of the legal semantic annotation device based on the LSTM network is that a preprocessing module 201 acquires a text and preprocesses the text to acquire the text to be analyzed; the first processing module 202 analyzes and processes the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converts all the words into D-dimension word vectors by adopting a word vector model, and inputs all the D-dimension word vectors into a fully-connected neural network to obtain feature codes of all the words; the second processing module 203 compares the part-of-speech labels of the text to be analyzed with the part-of-speech labels of the text in the preset database to obtain the most matched text in the preset database, and vectorizes the semantic role labels of the most matched text and the position information corresponding to the semantic role labels to obtain feature vectors; the obtaining module 204 combines the feature codes with the feature vectors to obtain a final vector representation; the output module 205 inputs the final vector representation into the fully connected neural network and outputs semantic role labels for each word in the text to be analyzed.
In summary, the invention provides a method and a device for marking legal semantics based on an LSTM network, which comprise the steps of obtaining a text and preprocessing the text to obtain the text to be analyzed; analyzing and processing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converting all the words into D-dimension word vectors by adopting a word vector model, and inputting all the D-dimension word vectors into a fully-connected neural network to obtain feature codes of all the words; comparing the part-of-speech labels of the text to be analyzed with the part-of-speech labels of the text in a preset database to obtain a most matched text in the preset database, and vectorizing semantic role labels of the most matched text and position information corresponding to the semantic role labels to obtain feature vectors; compounding the feature codes with the feature vectors to obtain final vector representation; and inputting the final vector representation into a fully-connected neural network, and outputting semantic role labels of each word in the text to be analyzed. The method and the system can automatically analyze the factors such as the constructors, the acceptors, the time, the places and the like in the legal laws, can assist related personnel to understand the semantics of the laws, provide support for higher-layer legal informatization application, and can effectively improve the working efficiency of staff.
It can be understood that the above-provided device embodiments correspond to the above-described method embodiments, and corresponding specific details may be referred to each other, which is not described herein again.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. The method for marking the legal meaning based on the LSTM network is characterized by comprising the following steps of:
acquiring a text and preprocessing the text to acquire a text to be analyzed;
analyzing and processing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converting all the words into D-dimension word vectors by adopting a word vector model, and inputting all the D-dimension word vectors into a fully-connected neural network to obtain feature codes of all the words;
comparing the part-of-speech labels of the text to be analyzed with the part-of-speech labels of the texts in a preset database to obtain a best-matching text in the preset database, and vectorizing semantic role labels of the best-matching text and position information corresponding to the semantic role labels to obtain feature vectors;
compounding the feature codes with the feature vectors to obtain final vector representations;
and inputting the final vector representation into a fully-connected neural network, and outputting semantic role labels of each word in the text to be analyzed.
2. The method of claim 1, wherein the obtaining text and preprocessing the text to obtain text to be analyzed comprises:
normalizing the text to obtain a text to be analyzed in a standard data input form; the text to be analyzed in the standard data input form is the text of the specified center predicate.
3. The method of claim 2, wherein the center predicate includes:
administrative subject, administrative relative person, time, place.
4. The method according to claim 1, wherein the analyzing the text to be analyzed to obtain all words of the text to be analyzed and the part-of-speech tags corresponding to the words includes:
splitting the text to be analyzed according to a legal dictionary by adopting a Chinese word segmentation tool and a part-of-speech tagging tool;
and acquiring all words of the analysis text and part-of-speech tags corresponding to the words.
5. The method of claim 1, wherein inputting all of the D-dimensional word vectors into a fully connected neural network obtains feature encodings of all of the words, comprising:
all D-dimension word vectors are sequentially input into a fully-connected neural network, the fully-connected neural network is provided with a feature encoder, the feature encoder comprises a 4-layer stacked bidirectional LSTM, and the method comprises the following steps: first layer LSTM, second layer LSTM, third layer LSTM, fourth layer LSTM;
the first layer LSTM takes the D-dimension word vector as input for encoding, then the input of each layer LSTM is the output of the upper layer, and the fourth layer LSTM outputs feature encoding.
6. The method according to claim 2, wherein comparing the part-of-speech tags of the text to be analyzed with the part-of-speech tags of the texts in a predetermined database to obtain the best matching text in the predetermined database comprises:
matching character strings to two sides by taking a central predicate as a center to mark the part of speech of a text to be analyzed and the part of speech mark of the text in a preset database;
and calculating the matching degree according to the matching length of the character strings, and obtaining the best matching text.
7. The method of claim 6, wherein the matching text semantic role labels and the corresponding location information of the semantic role labels are vectorized to obtain feature vectors,
vectorizing the semantic role labels of the most matched text to obtain a first vector representation;
vectorizing the distance between the semantic role labels and the central predicates to obtain a second vector representation;
the first vector representation and the second vector representation are composited into a feature vector.
8. The method of claim 1, wherein inputting the final vector representation into a fully connected neural network, outputting semantic role labels for each word in the text to be analyzed, comprises:
inputting the final vector into a fully connected neural network, wherein a softmax layer is arranged in the fully connected neural network, the softmax layer adopts a softmax classifier to carry out semantic role labeling on each word, and the softmax layer outputs the semantic role labeling.
9. The method of any one of claims 1 to 8, wherein the word vector model comprises:
word2vec language model, glove language model, or BERT language model.
10. The utility model provides a legal meaning annotate device based on LSTM network which characterized in that includes:
the preprocessing module is used for acquiring a text and preprocessing the text to acquire a text to be analyzed;
the first processing module is used for analyzing and processing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converting all the words into D-dimension word vectors by adopting a word vector model, and inputting all the D-dimension word vectors into a fully-connected neural network to obtain feature codes of all the words;
the second processing module is used for comparing the part-of-speech labels of the texts to be analyzed with the part-of-speech labels of the texts in a preset database to obtain the most matched texts in the preset database, and vectorizing semantic role labels of the most matched texts and position information corresponding to the semantic role labels to obtain feature vectors;
the acquisition module is used for compositing the feature codes with the feature vectors to acquire final vector representations;
and the output module is used for inputting the final vector representation into a fully-connected neural network and outputting semantic role labels of each word in the text to be analyzed.
CN202010273691.8A 2020-04-09 2020-04-09 French semantic annotation method and device based on LSTM network Active CN111460834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010273691.8A CN111460834B (en) 2020-04-09 2020-04-09 French semantic annotation method and device based on LSTM network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010273691.8A CN111460834B (en) 2020-04-09 2020-04-09 French semantic annotation method and device based on LSTM network

Publications (2)

Publication Number Publication Date
CN111460834A CN111460834A (en) 2020-07-28
CN111460834B true CN111460834B (en) 2023-06-06

Family

ID=71681233

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010273691.8A Active CN111460834B (en) 2020-04-09 2020-04-09 French semantic annotation method and device based on LSTM network

Country Status (1)

Country Link
CN (1) CN111460834B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113177406B (en) * 2021-04-23 2023-07-07 珠海格力电器股份有限公司 Text processing method, text processing device, electronic equipment and computer readable medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014110980A1 (en) * 2013-01-21 2014-07-24 Liu Shugen Ideographical member identification and extraction method and machine-translation and manual-correction interactive translation method based on ideographical members
CN105894088A (en) * 2016-03-25 2016-08-24 苏州赫博特医疗信息科技有限公司 Medical information extraction system and method based on depth learning and distributed semantic features
CN106202010A (en) * 2016-07-12 2016-12-07 重庆兆光科技股份有限公司 The method and apparatus building Law Text syntax tree based on deep neural network
CN109767758A (en) * 2019-01-11 2019-05-17 中山大学 Vehicle-mounted voice analysis method, system, storage medium and equipment
CN110276068A (en) * 2019-05-08 2019-09-24 清华大学 Law merit analysis method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014110980A1 (en) * 2013-01-21 2014-07-24 Liu Shugen Ideographical member identification and extraction method and machine-translation and manual-correction interactive translation method based on ideographical members
CN105894088A (en) * 2016-03-25 2016-08-24 苏州赫博特医疗信息科技有限公司 Medical information extraction system and method based on depth learning and distributed semantic features
CN106202010A (en) * 2016-07-12 2016-12-07 重庆兆光科技股份有限公司 The method and apparatus building Law Text syntax tree based on deep neural network
CN109767758A (en) * 2019-01-11 2019-05-17 中山大学 Vehicle-mounted voice analysis method, system, storage medium and equipment
CN110276068A (en) * 2019-05-08 2019-09-24 清华大学 Law merit analysis method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱鹏飞. 基于Bi-LSTM的汉语自动语义角色标注研究.《中国优秀硕士学位论文全文数据库 信息科技辑》.2019,全文. *

Also Published As

Publication number Publication date
CN111460834A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
CN112163416A (en) Event joint extraction method for merging syntactic and entity relation graph convolution network
CN109165563B (en) Pedestrian re-identification method and apparatus, electronic device, storage medium, and program product
CN113191148A (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN111783461A (en) Named entity identification method based on syntactic dependency relationship
CN111984780A (en) Multi-intention recognition model training method, multi-intention recognition method and related device
CN113934909A (en) Financial event extraction method based on pre-training language and deep learning model
CN111274371A (en) Intelligent man-machine conversation method and equipment based on knowledge graph
CN116775872A (en) Text processing method and device, electronic equipment and storage medium
CN116595195A (en) Knowledge graph construction method, device and medium
CN111291168A (en) Book retrieval method and device and readable storage medium
CN115587583A (en) Noise detection method and device and electronic equipment
CN115374786A (en) Entity and relationship combined extraction method and device, storage medium and terminal
CN115408488A (en) Segmentation method and system for novel scene text
CN114970536A (en) Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition
CN111460834B (en) French semantic annotation method and device based on LSTM network
CN113157918A (en) Commodity name short text classification method and system based on attention mechanism
US20220392205A1 (en) Method for training image recognition model based on semantic enhancement
CN116680407A (en) Knowledge graph construction method and device
CN114298052B (en) Entity joint annotation relation extraction method and system based on probability graph
CN115563278A (en) Question classification processing method and device for sentence text
CN115098687A (en) Alarm checking method and device for scheduling operation of electric power SDH optical transmission system
CN114021561A (en) Mathematical formula similarity calculation method and system
CN114637852A (en) Method, device and equipment for extracting entity relationship of medical text and storage medium
CN112487134A (en) Scientific and technological text problem extraction method based on extremely simple abstract strategy
CN110909547A (en) Judicial entity identification method based on improved deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant