CN116561594A

CN116561594A - Legal document similarity analysis method based on Word2vec

Info

Publication number: CN116561594A
Application number: CN202310236373.8A
Authority: CN
Inventors: 郑志松; 刘晓雷; 吴运昌; 丁仙峰
Original assignee: Jiangsu Shudui Technology Co ltd
Current assignee: Jiangsu Shudui Technology Co ltd
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-08-08

Abstract

The invention discloses a legal document similarity analysis method based on Word2vec, which comprises the following steps: defining two criteria for text similarity; creating a legal vocabulary corpus; counting and screening the corpus and the occurrence frequency of the vocabulary; dividing the same character strings in different texts, and calculating the ratio of intersection and combination between vocabularies; constructing a training word2vec model of a training set through a legal text data set; amplifying the vector space by acquiring the union of the original vector space and the micro corpus; evaluating the effect of the word2vec model through cosine similarity; optimizing a word2vec model by using a control variable method; according to the method, word2vec is combined with a legal document corpus, the shape of a decision boundary is described through the specific legal document corpus, similarity analysis of legal documents is completed, the defect of the decision boundary is made up, and the accuracy and the sensitivity of a model are improved.

Description

Legal document similarity analysis method based on Word2vec

Technical Field

The invention relates to the field of legal document similarity analysis, in particular to a Word2 vec-based legal document similarity analysis method.

Background

Advanced technologies such as big data, data analysis, internet of things, wireless technology, 3D printing and the like are penetrated into our daily life, and great progress is brought to the development of human economy and the convenience of life. Intelligent judicial is taken as an important component of intelligent society, and has achieved remarkable achievement in the aspects of intelligent prison, intelligent notarization, intelligent judgment and the like. With the increasing demand of people for computer aided judicial, the demand of intelligent judicial is increasing, and deep learning has become an effective means for helping intelligent judicial gradually. In recent years, the success of text similarity has led students to focus on the application of similarity analysis in legal literature.

The similarity analysis of legal documents is the basis of intelligent judicial. Legal documents based on different categories of cases vary greatly in format and length. In the field of text similarity, a corpus-based method overcomes the basic problem in natural language processing and achieves the accuracy competing with human recognition. However, it is prone to misleading and thus makes erroneous predictions, which cannot be tolerated in intelligent judicial applications where judicial fairness requirements are strict. So that text context relations and meaning inside the text need to be considered.

Traditional text similarity matching methods such as TF-IDF, BM25 and Jaccord, simHash, LDA are extremely dependent on manual feature searching, and the generalization capability is general, the number of features is limited, and the model effect is often very limited. Such text similarity analysis techniques may rely too much on the quality of the text representation in calculating the similarity of the text, and may also lose underlying text features such as lexical, syntactic, etc. The calculation process of the method is a black box process, the input requirement of the method is two sentences, the similarity value of the sentence pairs is output, no sentence vector representation is generated in the process, and a single sentence cannot be input, so that the method is difficult to adjust according to application scenes. Thus, such methods are not practical for tasks requiring text vector representations. In addition, the conventional text similarity is often limited in processing documents having the same length and format but different contents, without considering the meaning of embedding behind the words. The similarity analysis of legal documents is the basis of intelligent judicial, and legal documents based on different cases have large differences in formats and fields, so that difficulty is brought to the similarity analysis. Traditional similarity analysis based on vector space is often done using discrete expressions, which essentially cannot represent embedded meaning nor fix vector basis. Moreover, as with One-hot representation, if the discrete method represents a sparse matrix, the dimension of the vector becomes high, creating an unacceptable computational cost when encountering massive amounts of data. Finally, existing studies have demonstrated that such models can easily mislead the predicted results of the model when subjected to minor perturbations to text format and length. Although the term frequency and anti-document frequency are expected to compensate for errors in format and length, a relative result is achieved. But in judicial application scenarios where the accuracy requirements are extremely high, this result is still intolerable.

Disclosure of Invention

Therefore, it is necessary to provide a method for combining Word2vec with a legal document corpus, describing the shape of a decision boundary through a specific legal document corpus, completing similarity analysis of legal documents, and improving accuracy and sensitivity of a model by making up for the deficiency of the decision boundary.

In order to achieve the above object, the present inventors provide a legal document similarity analysis method based on Word2vec, comprising the following steps:

s101, defining two standards of text similarity to form a set of similarity between text strings;

s102, creating a legal vocabulary corpus, and calculating the occurrence frequency of vocabulary in the legal vocabulary corpus through term frequency and reverse document frequency;

s103, constructing a theme according to the application scene characteristics, and counting and screening the occurrence frequency of the language library and the vocabulary;

s104, dividing the same character strings in different texts, and calculating the intersection and joint proportion between vocabularies;

s105, constructing a training word2vec model of a training set through a legal text data set;

s106, amplifying the vector space by acquiring the union of the original vector space and the micro corpus;

s107, introducing basic indexes for measuring efficiency and accuracy, and evaluating the effect of the word2vec model through cosine similarity;

s108, processing data in the S104-S107 by using a control variable method, and optimizing word2vec model parameters;

s109, projecting the text into a shorter vector through word2vec for mapping the text to the vector;

s110, performing similarity matching on the segmented character strings, and calculating the similarity of the text on a character string similarity set forming the text.

As a preferred mode of the present invention, in step S101, the similarity is a measure of the difference between similar texts by comparing the difference in length or shape.

In step S102, a legal vocabulary corpus is created and the frequencies of different vocabularies are calculated, after the frequencies of vocabularies and corresponding words are forward ordered, irrelevant and common words are removed, alpha is used for representing all words, gamma is used for representing all irrelevant words, and the calculation expression of alpha is as follows:

as a preferred mode of the present invention, in step S103, the corpus includes a conventional corpus and a target corpus;

the conventional corpus contains words with a larger number than words belonging to training data, at this time, words which are not shared in the training data are deleted, K is used for representing all the training data, beta is used for representing the corpus, and the calculation expression of K is as follows:

the target corpus is used for enhancing the pertinence of the cases of the same type.

As a preferred mode of the present invention, step S104 includes: text is projected into a short vector through a distributed representation, the same character string in different texts is divided, and the ratio of intersection and union between words is calculated through a hot vector discretization process.

In a preferred mode of the present invention, in step S105, a conventional corpus is selected to train the word2vec model.

In step S106, the vector space is enlarged by obtaining the union of the original vector space and the micro corpus, M represents the vector space, δ represents the micro corpus, and the calculation expression of M is:

M＝train(M)

compared with the prior art, the beneficial effects achieved by the technical scheme are as follows:

the method combines Word2vec with a legal vocabulary corpus, and builds a text similarity standard according to legal scene characteristics. And constructing a legal vocabulary corpus and legal vocabulary word frequency statistical information according to the legal data set, and describing the shape of the decision boundary through the specific legal document corpus to complete similarity analysis of legal documents. In the process, the same character positions of different texts are segmented, and the proportion of intersection sets and union sets of segmented character strings is recorded to help to finish optimization of word2vec, so that word2vec can be more matched with legal scenes; in the model training process, the problem of accuracy deficiency of the word2vec model on a large-scale data set is effectively solved by a method of amplifying space vectors, indexes for measuring efficiency and accuracy are introduced, and effects of the model are evaluated through cosine similarity, so that optimization of the word2vec model is completed. In the process, the accuracy and the sensitivity of the word2vec model are effectively improved by making up the deficiency of the decision boundary.

The word2vec model based on the legal vocabulary corpus is proved to be a progressive and careful method for solving the problem of legal text similarity; and provides a direction for the establishment of a more accurate technology, and completes the increase and decrease of criminal amount and the prediction of criminal period to assist the judgment process.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment;

FIG. 2 is a block diagram of a method according to an embodiment;

FIG. 3 is a diagram comparing word2vec with bow according to the embodiment;

FIG. 4 is a comparison of word2vec models over different legal documents, according to an embodiment.

Detailed Description

In order to describe the technical content, constructional features, achieved objects and effects of the technical solution in detail, the following description is made in connection with the specific embodiments in conjunction with the accompanying drawings.

As shown in fig. 1 and 2, the present embodiment provides a legal document similarity analysis method based on Word2vec, which includes the following steps:

In the implementation process of the above embodiment:

in step S101, the similarity is clarified, and the difference between similar sentences is measured by comparing the exact differences in length or shape. The N-Gram model computes similarity by serial segmentation of the same characters in different sentences. The number of shared substrings is a criterion defining the similarity of two sentences. Wherein, the N-Gram model is suitable for the condition that the related sentences depend on a small corpus or a small vocabulary; the expression is:

Similarity＝|G _N (S)|+|G _N (T)|-2*|G _N (S)∩G _N (T)|

wherein Similarity represents Similarity, and its value range is 0,1]Within the range; g _N Representing an N-Gram model, which is a language model used in large-vocabulary continuous speech recognition, and can realize automatic conversion into Chinese characters by utilizing collocation information between adjacent words in the context; s and T represent two pieces of text to be matched, respectively.

In step S102 of this embodiment, a corpus filled with words in legal documents is created, ensuring a fixed position based on the huge corpus. The frequency of occurrence of the vocabulary is counted by their term frequency-inverse document frequency (TF-IDF), the idea being that: the importance of a word is relative to its frequency in sentences. The importance of a word is relative to the frequencies in the sentence, positively correlated with the frequencies in the sentence, and negatively correlated with the frequencies in the corpus.

In the implementation process of step S103, the corpus is counted and screened according to the similarity analysis based on the vector space. In one aspect, we can select a conventional corpus provided by a regulatory agency. Such corpora typically have a larger vocabulary and a wider range of keywords. Typically, a conventional corpus contains more words than are belonging to training data. In this case, words that are not shared in the training data may be deleted to improve accuracy. We represent all training data with K and corpus with β. Wherein the calculation formula of K is as follows:

on the other hand, if our goal is to strengthen the pertinence to the same type of case, such as to theft cases, in this embodiment, a target corpus should be created. The process of creating a corpus is mainly related to the frequency of computing the different vocabularies. After forward ordering the word frequencies with the corresponding words, irrelevant but common words, such as auxiliary words, may be excluded. Once the ten thousand words with the highest frequency of occurrence are obtained, the basic portion of the corpus is completed. We denote all words by α and all irrelevant words by γ. Wherein, the calculation formula of alpha is as follows:

in step S104, sentences may be projected into a shorter vector by the distributed representation, and the same character string in different sentences is segmented; the ratio of intersections and unions between words is calculated by a thermal vector discretization process.

In step S105, a regular corpus provided by a regular organization is selected to train a word2vec model.

In step S106, in order to make up for the shortcoming of huge data, the use of word2vec model needs to be changed, and the previous combination between vector space and micro corpus can be obtained by enlarging the space vector. Thus, the word2vec model is based on a new and large vector space with a finer representation in the small corpus revised dimension. M represents vector space, delta represents micro corpus; wherein, the calculation formula of M is as follows:

M＝train(M)

in the implementation of step S107, the evaluation of the word2vec model based on the legal vocabulary corpus is presented depends on the selected theft case. The evaluation of the created legal vocabulary corpus depends on the selected theft case. The basic index of efficiency and accuracy is that we introduce a basic index similarity ratio of efficiency and accuracy calculated by comparing each single-dimensional vector in different vector expressions. Since the cosine similarity can judge the quality of several models, since the sentences are selected from the same type and formally clarified as extremely similar sentences, the higher the cosine similarity, the more accurate the word2vec model, and the quality of the models is judged by the cosine similarity.

In step S108, the data in the processes of S104-S107 are processed by using a control variable method, and word2vec model parameters are optimized; in step S109, the text is projected to a shorter vector through word2vec for mapping the text to the vector, so as to complete text vectorization processing; in step S110, similarity matching is performed on the split character strings, and similarity of the text is calculated on a set of character string similarities constituting the text.

Fig. 3 is a comparison of the effects of the word2vec model and the bow model in the legal scenario of the present embodiment, it can be seen that the effect of the word2vec model in the present embodiment when matching the similarity of character strings is always due to the bow model, and the effect of the word2vec model in the present embodiment is always better than that of the bow model along with the increase of the legal text data set, and the better performance is always maintained.

Fig. 4 is a comparison between a conventional word2vec model and a word2vec model in the present embodiment, and it can be seen that the word2vec model in the present embodiment still maintains a good effect when facing a legal data set with a smaller scale, because the effect exerted by the legal vocabulary corpus and word frequency statistical information constructed in advance in the present embodiment is achieved, and the problem of model accuracy deficiency is solved by amplifying the space vector, so that the effect of the word2vec model in the present embodiment is better than that of the conventional word2vec model.

It should be noted that, although the foregoing embodiments have been described herein, the scope of the present invention is not limited thereby. Therefore, based on the innovative concepts of the present invention, alterations and modifications to the embodiments described herein, or equivalent structures or equivalent flow transformations made by the present description and drawings, apply the above technical solution, directly or indirectly, to other relevant technical fields, all of which are included in the scope of the invention.

Claims

1. A legal document similarity analysis method based on Word2vec is characterized by comprising the following steps:

2. The Word2 vec-based legal document similarity analysis method according to claim 1, wherein the method comprises the following steps: in step S101, the similarity is to measure the difference between similar texts by comparing the difference in length or shape.

3. The Word2 vec-based legal document similarity analysis method according to claim 1, wherein the method comprises the following steps: in step S102, a legal vocabulary corpus is created and the frequencies of different vocabularies are calculated, after the frequencies of vocabularies and corresponding words are forward ordered, irrelevant and common words are removed, all words are represented by α, all irrelevant words are represented by γ, and the calculation expression of α is:

4. the Word2 vec-based legal document similarity analysis method according to claim 1, wherein the method comprises the following steps: in step S103, the corpus includes a conventional corpus and a target corpus;

5. The Word2 vec-based legal document similarity analysis method according to claim 4, wherein step S104 comprises: text is projected into a short vector through a distributed representation, the same character string in different texts is divided, and the ratio of intersection and union between words is calculated through a hot vector discretization process.

6. The Word2 vec-based legal document similarity analysis method according to claim 5, wherein the method comprises the steps of: in step S105, a conventional corpus is selected to train a word2vec model.

7. The Word2 vec-based legal document similarity analysis method of claim 6, wherein: in step S106, the vector space is represented by M by acquiring the union amplified vector space of the original vector space and the micro corpus, δ represents the micro corpus, and the calculation expression of M is:

M＝train(M)