CN116561594A - Legal document similarity analysis method based on Word2vec - Google Patents

Legal document similarity analysis method based on Word2vec Download PDF

Info

Publication number
CN116561594A
CN116561594A CN202310236373.8A CN202310236373A CN116561594A CN 116561594 A CN116561594 A CN 116561594A CN 202310236373 A CN202310236373 A CN 202310236373A CN 116561594 A CN116561594 A CN 116561594A
Authority
CN
China
Prior art keywords
corpus
similarity
legal
text
word2vec
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310236373.8A
Other languages
Chinese (zh)
Inventor
郑志松
刘晓雷
吴运昌
丁仙峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Shudui Technology Co ltd
Original Assignee
Jiangsu Shudui Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Shudui Technology Co ltd filed Critical Jiangsu Shudui Technology Co ltd
Priority to CN202310236373.8A priority Critical patent/CN116561594A/en
Publication of CN116561594A publication Critical patent/CN116561594A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a legal document similarity analysis method based on Word2vec, which comprises the following steps: defining two criteria for text similarity; creating a legal vocabulary corpus; counting and screening the corpus and the occurrence frequency of the vocabulary; dividing the same character strings in different texts, and calculating the ratio of intersection and combination between vocabularies; constructing a training word2vec model of a training set through a legal text data set; amplifying the vector space by acquiring the union of the original vector space and the micro corpus; evaluating the effect of the word2vec model through cosine similarity; optimizing a word2vec model by using a control variable method; according to the method, word2vec is combined with a legal document corpus, the shape of a decision boundary is described through the specific legal document corpus, similarity analysis of legal documents is completed, the defect of the decision boundary is made up, and the accuracy and the sensitivity of a model are improved.

Description

Legal document similarity analysis method based on Word2vec
Technical Field
The invention relates to the field of legal document similarity analysis, in particular to a Word2 vec-based legal document similarity analysis method.
Background
Advanced technologies such as big data, data analysis, internet of things, wireless technology, 3D printing and the like are penetrated into our daily life, and great progress is brought to the development of human economy and the convenience of life. Intelligent judicial is taken as an important component of intelligent society, and has achieved remarkable achievement in the aspects of intelligent prison, intelligent notarization, intelligent judgment and the like. With the increasing demand of people for computer aided judicial, the demand of intelligent judicial is increasing, and deep learning has become an effective means for helping intelligent judicial gradually. In recent years, the success of text similarity has led students to focus on the application of similarity analysis in legal literature.
The similarity analysis of legal documents is the basis of intelligent judicial. Legal documents based on different categories of cases vary greatly in format and length. In the field of text similarity, a corpus-based method overcomes the basic problem in natural language processing and achieves the accuracy competing with human recognition. However, it is prone to misleading and thus makes erroneous predictions, which cannot be tolerated in intelligent judicial applications where judicial fairness requirements are strict. So that text context relations and meaning inside the text need to be considered.
Traditional text similarity matching methods such as TF-IDF, BM25 and Jaccord, simHash, LDA are extremely dependent on manual feature searching, and the generalization capability is general, the number of features is limited, and the model effect is often very limited. Such text similarity analysis techniques may rely too much on the quality of the text representation in calculating the similarity of the text, and may also lose underlying text features such as lexical, syntactic, etc. The calculation process of the method is a black box process, the input requirement of the method is two sentences, the similarity value of the sentence pairs is output, no sentence vector representation is generated in the process, and a single sentence cannot be input, so that the method is difficult to adjust according to application scenes. Thus, such methods are not practical for tasks requiring text vector representations. In addition, the conventional text similarity is often limited in processing documents having the same length and format but different contents, without considering the meaning of embedding behind the words. The similarity analysis of legal documents is the basis of intelligent judicial, and legal documents based on different cases have large differences in formats and fields, so that difficulty is brought to the similarity analysis. Traditional similarity analysis based on vector space is often done using discrete expressions, which essentially cannot represent embedded meaning nor fix vector basis. Moreover, as with One-hot representation, if the discrete method represents a sparse matrix, the dimension of the vector becomes high, creating an unacceptable computational cost when encountering massive amounts of data. Finally, existing studies have demonstrated that such models can easily mislead the predicted results of the model when subjected to minor perturbations to text format and length. Although the term frequency and anti-document frequency are expected to compensate for errors in format and length, a relative result is achieved. But in judicial application scenarios where the accuracy requirements are extremely high, this result is still intolerable.
Disclosure of Invention
Therefore, it is necessary to provide a method for combining Word2vec with a legal document corpus, describing the shape of a decision boundary through a specific legal document corpus, completing similarity analysis of legal documents, and improving accuracy and sensitivity of a model by making up for the deficiency of the decision boundary.
In order to achieve the above object, the present inventors provide a legal document similarity analysis method based on Word2vec, comprising the following steps:
s101, defining two standards of text similarity to form a set of similarity between text strings;
s102, creating a legal vocabulary corpus, and calculating the occurrence frequency of vocabulary in the legal vocabulary corpus through term frequency and reverse document frequency;
s103, constructing a theme according to the application scene characteristics, and counting and screening the occurrence frequency of the language library and the vocabulary;
s104, dividing the same character strings in different texts, and calculating the intersection and joint proportion between vocabularies;
s105, constructing a training word2vec model of a training set through a legal text data set;
s106, amplifying the vector space by acquiring the union of the original vector space and the micro corpus;
s107, introducing basic indexes for measuring efficiency and accuracy, and evaluating the effect of the word2vec model through cosine similarity;
s108, processing data in the S104-S107 by using a control variable method, and optimizing word2vec model parameters;
s109, projecting the text into a shorter vector through word2vec for mapping the text to the vector;
s110, performing similarity matching on the segmented character strings, and calculating the similarity of the text on a character string similarity set forming the text.
As a preferred mode of the present invention, in step S101, the similarity is a measure of the difference between similar texts by comparing the difference in length or shape.
In step S102, a legal vocabulary corpus is created and the frequencies of different vocabularies are calculated, after the frequencies of vocabularies and corresponding words are forward ordered, irrelevant and common words are removed, alpha is used for representing all words, gamma is used for representing all irrelevant words, and the calculation expression of alpha is as follows:
as a preferred mode of the present invention, in step S103, the corpus includes a conventional corpus and a target corpus;
the conventional corpus contains words with a larger number than words belonging to training data, at this time, words which are not shared in the training data are deleted, K is used for representing all the training data, beta is used for representing the corpus, and the calculation expression of K is as follows:
the target corpus is used for enhancing the pertinence of the cases of the same type.
As a preferred mode of the present invention, step S104 includes: text is projected into a short vector through a distributed representation, the same character string in different texts is divided, and the ratio of intersection and union between words is calculated through a hot vector discretization process.
In a preferred mode of the present invention, in step S105, a conventional corpus is selected to train the word2vec model.
In step S106, the vector space is enlarged by obtaining the union of the original vector space and the micro corpus, M represents the vector space, δ represents the micro corpus, and the calculation expression of M is:
M=train(M)
compared with the prior art, the beneficial effects achieved by the technical scheme are as follows:
the method combines Word2vec with a legal vocabulary corpus, and builds a text similarity standard according to legal scene characteristics. And constructing a legal vocabulary corpus and legal vocabulary word frequency statistical information according to the legal data set, and describing the shape of the decision boundary through the specific legal document corpus to complete similarity analysis of legal documents. In the process, the same character positions of different texts are segmented, and the proportion of intersection sets and union sets of segmented character strings is recorded to help to finish optimization of word2vec, so that word2vec can be more matched with legal scenes; in the model training process, the problem of accuracy deficiency of the word2vec model on a large-scale data set is effectively solved by a method of amplifying space vectors, indexes for measuring efficiency and accuracy are introduced, and effects of the model are evaluated through cosine similarity, so that optimization of the word2vec model is completed. In the process, the accuracy and the sensitivity of the word2vec model are effectively improved by making up the deficiency of the decision boundary.
The word2vec model based on the legal vocabulary corpus is proved to be a progressive and careful method for solving the problem of legal text similarity; and provides a direction for the establishment of a more accurate technology, and completes the increase and decrease of criminal amount and the prediction of criminal period to assist the judgment process.
Drawings
FIG. 1 is a flow chart of a method according to an embodiment;
FIG. 2 is a block diagram of a method according to an embodiment;
FIG. 3 is a diagram comparing word2vec with bow according to the embodiment;
FIG. 4 is a comparison of word2vec models over different legal documents, according to an embodiment.
Detailed Description
In order to describe the technical content, constructional features, achieved objects and effects of the technical solution in detail, the following description is made in connection with the specific embodiments in conjunction with the accompanying drawings.
As shown in fig. 1 and 2, the present embodiment provides a legal document similarity analysis method based on Word2vec, which includes the following steps:
s101, defining two standards of text similarity to form a set of similarity between text strings;
s102, creating a legal vocabulary corpus, and calculating the occurrence frequency of vocabulary in the legal vocabulary corpus through term frequency and reverse document frequency;
s103, constructing a theme according to the application scene characteristics, and counting and screening the occurrence frequency of the language library and the vocabulary;
s104, dividing the same character strings in different texts, and calculating the intersection and joint proportion between vocabularies;
s105, constructing a training word2vec model of a training set through a legal text data set;
s106, amplifying the vector space by acquiring the union of the original vector space and the micro corpus;
s107, introducing basic indexes for measuring efficiency and accuracy, and evaluating the effect of the word2vec model through cosine similarity;
s108, processing data in the S104-S107 by using a control variable method, and optimizing word2vec model parameters;
s109, projecting the text into a shorter vector through word2vec for mapping the text to the vector;
s110, performing similarity matching on the segmented character strings, and calculating the similarity of the text on a character string similarity set forming the text.
In the implementation process of the above embodiment:
in step S101, the similarity is clarified, and the difference between similar sentences is measured by comparing the exact differences in length or shape. The N-Gram model computes similarity by serial segmentation of the same characters in different sentences. The number of shared substrings is a criterion defining the similarity of two sentences. Wherein, the N-Gram model is suitable for the condition that the related sentences depend on a small corpus or a small vocabulary; the expression is:
Similarity=|G N (S)|+|G N (T)|-2*|G N (S)∩G N (T)|
wherein Similarity represents Similarity, and its value range is 0,1]Within the range; g N Representing an N-Gram model, which is a language model used in large-vocabulary continuous speech recognition, and can realize automatic conversion into Chinese characters by utilizing collocation information between adjacent words in the context; s and T represent two pieces of text to be matched, respectively.
In step S102 of this embodiment, a corpus filled with words in legal documents is created, ensuring a fixed position based on the huge corpus. The frequency of occurrence of the vocabulary is counted by their term frequency-inverse document frequency (TF-IDF), the idea being that: the importance of a word is relative to its frequency in sentences. The importance of a word is relative to the frequencies in the sentence, positively correlated with the frequencies in the sentence, and negatively correlated with the frequencies in the corpus.
In the implementation process of step S103, the corpus is counted and screened according to the similarity analysis based on the vector space. In one aspect, we can select a conventional corpus provided by a regulatory agency. Such corpora typically have a larger vocabulary and a wider range of keywords. Typically, a conventional corpus contains more words than are belonging to training data. In this case, words that are not shared in the training data may be deleted to improve accuracy. We represent all training data with K and corpus with β. Wherein the calculation formula of K is as follows:
on the other hand, if our goal is to strengthen the pertinence to the same type of case, such as to theft cases, in this embodiment, a target corpus should be created. The process of creating a corpus is mainly related to the frequency of computing the different vocabularies. After forward ordering the word frequencies with the corresponding words, irrelevant but common words, such as auxiliary words, may be excluded. Once the ten thousand words with the highest frequency of occurrence are obtained, the basic portion of the corpus is completed. We denote all words by α and all irrelevant words by γ. Wherein, the calculation formula of alpha is as follows:
in step S104, sentences may be projected into a shorter vector by the distributed representation, and the same character string in different sentences is segmented; the ratio of intersections and unions between words is calculated by a thermal vector discretization process.
In step S105, a regular corpus provided by a regular organization is selected to train a word2vec model.
In step S106, in order to make up for the shortcoming of huge data, the use of word2vec model needs to be changed, and the previous combination between vector space and micro corpus can be obtained by enlarging the space vector. Thus, the word2vec model is based on a new and large vector space with a finer representation in the small corpus revised dimension. M represents vector space, delta represents micro corpus; wherein, the calculation formula of M is as follows:
M=train(M)
in the implementation of step S107, the evaluation of the word2vec model based on the legal vocabulary corpus is presented depends on the selected theft case. The evaluation of the created legal vocabulary corpus depends on the selected theft case. The basic index of efficiency and accuracy is that we introduce a basic index similarity ratio of efficiency and accuracy calculated by comparing each single-dimensional vector in different vector expressions. Since the cosine similarity can judge the quality of several models, since the sentences are selected from the same type and formally clarified as extremely similar sentences, the higher the cosine similarity, the more accurate the word2vec model, and the quality of the models is judged by the cosine similarity.
In step S108, the data in the processes of S104-S107 are processed by using a control variable method, and word2vec model parameters are optimized; in step S109, the text is projected to a shorter vector through word2vec for mapping the text to the vector, so as to complete text vectorization processing; in step S110, similarity matching is performed on the split character strings, and similarity of the text is calculated on a set of character string similarities constituting the text.
Fig. 3 is a comparison of the effects of the word2vec model and the bow model in the legal scenario of the present embodiment, it can be seen that the effect of the word2vec model in the present embodiment when matching the similarity of character strings is always due to the bow model, and the effect of the word2vec model in the present embodiment is always better than that of the bow model along with the increase of the legal text data set, and the better performance is always maintained.
Fig. 4 is a comparison between a conventional word2vec model and a word2vec model in the present embodiment, and it can be seen that the word2vec model in the present embodiment still maintains a good effect when facing a legal data set with a smaller scale, because the effect exerted by the legal vocabulary corpus and word frequency statistical information constructed in advance in the present embodiment is achieved, and the problem of model accuracy deficiency is solved by amplifying the space vector, so that the effect of the word2vec model in the present embodiment is better than that of the conventional word2vec model.
It should be noted that, although the foregoing embodiments have been described herein, the scope of the present invention is not limited thereby. Therefore, based on the innovative concepts of the present invention, alterations and modifications to the embodiments described herein, or equivalent structures or equivalent flow transformations made by the present description and drawings, apply the above technical solution, directly or indirectly, to other relevant technical fields, all of which are included in the scope of the invention.

Claims (7)

1. A legal document similarity analysis method based on Word2vec is characterized by comprising the following steps:
s101, defining two standards of text similarity to form a set of similarity between text strings;
s102, creating a legal vocabulary corpus, and calculating the occurrence frequency of vocabulary in the legal vocabulary corpus through term frequency and reverse document frequency;
s103, constructing a theme according to the application scene characteristics, and counting and screening the occurrence frequency of the language library and the vocabulary;
s104, dividing the same character strings in different texts, and calculating the intersection and joint proportion between vocabularies;
s105, constructing a training word2vec model of a training set through a legal text data set;
s106, amplifying the vector space by acquiring the union of the original vector space and the micro corpus;
s107, introducing basic indexes for measuring efficiency and accuracy, and evaluating the effect of the word2vec model through cosine similarity;
s108, processing data in the S104-S107 by using a control variable method, and optimizing word2vec model parameters;
s109, projecting the text into a shorter vector through word2vec for mapping the text to the vector;
s110, performing similarity matching on the segmented character strings, and calculating the similarity of the text on a character string similarity set forming the text.
2. The Word2 vec-based legal document similarity analysis method according to claim 1, wherein the method comprises the following steps: in step S101, the similarity is to measure the difference between similar texts by comparing the difference in length or shape.
3. The Word2 vec-based legal document similarity analysis method according to claim 1, wherein the method comprises the following steps: in step S102, a legal vocabulary corpus is created and the frequencies of different vocabularies are calculated, after the frequencies of vocabularies and corresponding words are forward ordered, irrelevant and common words are removed, all words are represented by α, all irrelevant words are represented by γ, and the calculation expression of α is:
4. the Word2 vec-based legal document similarity analysis method according to claim 1, wherein the method comprises the following steps: in step S103, the corpus includes a conventional corpus and a target corpus;
the conventional corpus contains words with a larger number than words belonging to training data, at this time, words which are not shared in the training data are deleted, K is used for representing all the training data, beta is used for representing the corpus, and the calculation expression of K is as follows:
the target corpus is used for enhancing the pertinence of the cases of the same type.
5. The Word2 vec-based legal document similarity analysis method according to claim 4, wherein step S104 comprises: text is projected into a short vector through a distributed representation, the same character string in different texts is divided, and the ratio of intersection and union between words is calculated through a hot vector discretization process.
6. The Word2 vec-based legal document similarity analysis method according to claim 5, wherein the method comprises the steps of: in step S105, a conventional corpus is selected to train a word2vec model.
7. The Word2 vec-based legal document similarity analysis method of claim 6, wherein: in step S106, the vector space is represented by M by acquiring the union amplified vector space of the original vector space and the micro corpus, δ represents the micro corpus, and the calculation expression of M is:
M=train(M)
CN202310236373.8A 2023-03-13 2023-03-13 Legal document similarity analysis method based on Word2vec Pending CN116561594A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310236373.8A CN116561594A (en) 2023-03-13 2023-03-13 Legal document similarity analysis method based on Word2vec

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310236373.8A CN116561594A (en) 2023-03-13 2023-03-13 Legal document similarity analysis method based on Word2vec

Publications (1)

Publication Number Publication Date
CN116561594A true CN116561594A (en) 2023-08-08

Family

ID=87490505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310236373.8A Pending CN116561594A (en) 2023-03-13 2023-03-13 Legal document similarity analysis method based on Word2vec

Country Status (1)

Country Link
CN (1) CN116561594A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117350288A (en) * 2023-12-01 2024-01-05 浙商银行股份有限公司 Case matching-based network security operation auxiliary decision-making method, system and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117350288A (en) * 2023-12-01 2024-01-05 浙商银行股份有限公司 Case matching-based network security operation auxiliary decision-making method, system and device
CN117350288B (en) * 2023-12-01 2024-05-03 浙商银行股份有限公司 Case matching-based network security operation auxiliary decision-making method, system and device

Similar Documents

Publication Publication Date Title
Jung Semantic vector learning for natural language understanding
CN110119765B (en) Keyword extraction method based on Seq2Seq framework
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
Wang et al. Multilayer dense attention model for image caption
CN113792818A (en) Intention classification method and device, electronic equipment and computer readable storage medium
CN107273913B (en) Short text similarity calculation method based on multi-feature fusion
CN101079025B (en) File correlation computing system and method
CN110674252A (en) High-precision semantic search system for judicial domain
CN112307182B (en) Question-answering system-based pseudo-correlation feedback extended query method
CN112818661B (en) Patent technology keyword unsupervised extraction method
CN112926345A (en) Multi-feature fusion neural machine translation error detection method based on data enhancement training
CN112417854A (en) Chinese document abstraction type abstract method
CN113761890A (en) BERT context sensing-based multi-level semantic information retrieval method
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN114997288A (en) Design resource association method
CN115269882A (en) Intellectual property retrieval system and method based on semantic understanding
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
CN114647715A (en) Entity recognition method based on pre-training language model
Li et al. Dimsim: An accurate chinese phonetic similarity algorithm based on learned high dimensional encoding
CN116561594A (en) Legal document similarity analysis method based on Word2vec
CN116680363A (en) Emotion analysis method based on multi-mode comment data
Li et al. STD: An automatic evaluation metric for machine translation based on word embeddings
CN112287119B (en) Knowledge graph generation method for extracting relevant information of online resources
CN114064901A (en) Book comment text classification method based on knowledge graph word meaning disambiguation
CN113836941B (en) Contract navigation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination