CN109543178B

CN109543178B - Method and system for constructing judicial text label system

Info

Publication number: CN109543178B
Application number: CN201811294777.8A
Authority: CN
Inventors: 丁锴; 李建元; 陈涛; 王开红
Original assignee: Yinjiang Technology Co ltd
Current assignee: Yinjiang Technology Co ltd
Priority date: 2018-11-01
Filing date: 2018-11-01
Publication date: 2023-02-28
Anticipated expiration: 2038-11-01
Also published as: CN109543178A

Abstract

The application provides a method and a system for constructing a judicial text label system. Acquiring a judicial vocabulary text through a word segmentation tool, constructing a primary label system according to word frequency statistics, merging labels with similar semantics in the primary label system, expanding the unsmooth labels to obtain an expanded label system, counting the accuracy of searching the text of the expanded label system by using a text test set, verifying whether the current expanded label system is constructed, and otherwise, further optimizing the label system. The method realizes the construction of a targeted label system for different laws, and greatly improves the search precision of the judicial texts.

Description

Method and system for constructing judicial text label system

Technical Field

The application relates to the field of natural language processing, in particular to a method and a system for constructing a judicial text label system.

Background

With the public and transparent nature of the legal field, more and more official documents are placed under the supervision of the public. According to statistics of a Chinese judge document network, over 5 thousand documents are on line at present, and the scale is increased by about 3 ten thousand every day. However, the increase of legal text resources also brings a series of problems, such as the problems of larger and larger storage capacity, slower and slower search speed, and the search result is not the desired information. These problems result in a reduction in the efficiency of use of legal text resources. In order to solve these problems, legal texts are handled. A common method for processing mass data of the Internet is to carry out data tagging, namely a Vector Space Model (Vector Space Model). The data is processed into a series of keywords (Term) or tags, which are then used to generate an index code. Legal text processing also uses this model, except for how the tags are defined.

There has been a lot of work on text label extraction. Patent CN201510697001 proposes to dig out notification short messages by writing regular expressions for existing short message texts; using the mined XX as identity label information of a short message text; and for the excavated short message text identity of the notification type, the identity label information with the highest frequency is taken as the final identity label information of the service number in a threshold value mode. And the identity label can be updated in real time when a new short message arrives. Patent CN201710541481 proposes a text label generation method, which performs keyword extraction by respectively adopting strategies corresponding to each label type for a target text to obtain candidate labels of each label type of the target text, then performs cross validation on the candidate labels of each label type among different label types, and finally determines a target label of the target text according to the validated candidate labels. The label extraction is carried out respectively aiming at different label types including entity words, segment texts and/or topics, and cross verification is carried out, so that the label extraction accuracy is improved, and the technical problem of low label extraction accuracy in the prior art is solved. Patent CN201711213971 proposes a method for generating text label words. Firstly, extracting label words in a text, and generating correlated grouped label words according to the extracted label words and a preset label word relation; aggregating the grouped tag words according to the incidence relation among the grouped tag words, and searching the aggregated grouped tag words which can be completely covered by the text in a preset tag word dictionary to obtain combined tag words; and finally, generating mapping label words in the text according to the combined label words and the preset label word relation. The corresponding label words can be generated for the text quickly and independently according to actual requirements without the intervention of professional personnel. CN201510197328 proposes a text label extraction method, which includes, first, performing text category prediction, then performing topic prediction through a topic clustering model to obtain a prediction topic, then, extracting text keywords, and finally, taking text target categories, target topics, and target keywords as labels of the text. The text labels have different levels, so that the retrieval requirements of different granularities are met, and the recommended articles of different granularities can be provided according to different labels.

Due to the characteristics of more professional words of legal texts, high coincidence rate of case dispute points and the like, the text label extraction method cannot meet the accuracy requirement. Therefore, a new label system is provided, a label dictionary is established through a series of regularizations, and the label dictionary is verified and optimized through the corresponding relation between legal cases and laws, so that the search precision of legal texts is improved.

Disclosure of Invention

The invention provides a method and a system for constructing a judicial text label system, aiming at the problems of more professional words of legal texts, high coincidence degree of case dispute points and the like. Due to the combination of the advantages of machine learning and rechecking, the accuracy of legal text retrieval can be obviously improved on the basis of reducing manual intervention.

A method for constructing a judicial text label system is characterized by comprising the following steps:

acquiring a vocabulary text, wherein the vocabulary text refers to a form of representing a text by a vocabulary;

selecting candidate labels according to the word frequency and/or the combined word frequency of the vocabulary text to obtain a primary label system;

merging and/or expanding labels according to the similarity of the labels in the primary label system to obtain an expanded label system;

and determining that the final label system is constructed according to the accuracy of the text searched by the expanded label system.

Further, obtaining vocabulary text, comprising: constructing a judicial vocabulary, adding the judicial vocabulary into a custom dictionary of a word segmentation tool, and segmenting a judicial text to obtain a vocabulary text;

wherein, the constructing the judicial vocabulary comprises:

adding entries of a legal dictionary, a legal professional lexicon and the like into a prepared vocabulary;

counting the combined word frequency of the conventional words, and adding the conventional word combination with the combined word frequency meeting a set threshold value I into a prepared vocabulary table as a new word;

rechecking, namely adding the unsingulated correct professional vocabulary into the prepared vocabulary;

a judicial vocabulary is obtained.

Further, according to the word frequency and the combined word frequency of the vocabulary text, selecting a candidate tag to obtain a primary tag system, comprising:

defining the window length K, counting the occurrence times of any M vocabulary combinations by using a window traversal method, taking the vocabulary in the N combinations with the highest occurrence times as a keyword, counting the word frequency of a single vocabulary in the keyword, taking the vocabulary with the word frequency meeting a set threshold II as a candidate tag, and adding the candidate tag into a primary tag system.

Further, the similarity of the labels is calculated by the method comprising the following steps:

setting a character-based label similarity weight p and a semantic-based label similarity weight q;

acquiring label similarity sim (W1, W2) of labels W1 and W2 based on characters, wherein sim (W1, W2) = the number of the same characters in label W1 and label W2/the larger value of the character length of label W1 and label W2;

obtaining tag similarity score (W1, W2) of tags W1 and W2 based on semantics, wherein score (W1, W2) is a correlation value of tag W1 and tag W2, and the correlation value is obtained from a semantic model trained by using a judicial text as a corpus;

the similarity of the tags = p × sim (W1, W2) + q × score (W1, W2) was calculated.

Further, the air conditioner is provided with a fan,

merging the labels, specifically, when the similarity of the two labels meets a set threshold value III, or the similarity of the two labels is R bits before the label similarity value of the primary label system, merging the two labels, retaining one of the labels, and removing the other label from the primary label system;

and expanding the label, specifically, when the similarity between a plurality of words and label words in the semantic model or the synonym dictionary meets a set threshold value IV, taking the words as the expanded words of the label words, and adding the expanded words into a primary label system.

Further, the accuracy of the search text is calculated by:

and establishing a test set, wherein the test set comprises a sample set and a search object set. The sample set comprises one question and n cases most relevant to the question and m pieces of law most relevant to the question. The search object set comprises all case and legal system sets;

extracting text labels of problems, cases and legal notes in a sample set to form a label vector;

recommending cases similar to the problems and applicable laws in the search object set by using a vector matching method, wherein the vector similarity is calculated by using an Euler distance;

calculating accuracy by recommending comparison of cases and law bars with cases and law bars corresponding to the sample set, wherein the accuracy is represented by using an average value of recall rate and accuracy, the recall rate is also called recall rate, and the recall rate = the correct number of samples/the correct number of samples in the data set; the accuracy is also called precision, and the accuracy = number of detected samples/number of detected samples.

Further, the accuracy of the search text is calculated by:

presetting a sample set and a search object set, wherein the sample set SS comprises NC samples and a sample S _i Including a search question Q _i And a vocabulary text set X related to the search problem _i Said set of lexical texts X _i Comprising Hi vocabulary texts, xi = { x _i1 ,x _i2 ,…,x _iHi }; the search object set Y includes NS vocabulary texts, Y = { Y = { (Y) ₁ ,y ₂ ,…,y _NS }；

Obtaining the extension label Z, Z = { Z ] of the search object set Y by utilizing an extension label system ₁ ,z ₂ ,…,z _NS }；

Sequentially extracting a sample S from the sample set _i Obtaining the search question Q _i Tag vector T of _i ；

Computing a tag vector T _i And extension tag Z _j Taking vocabulary texts corresponding to the top Hi expansion labels with the highest similarity to form a comparison group T;

calculating single search accuracy = number/Hi of the control group T equal to the number of vocabulary texts in the set Xi;

and traversing the whole sample set, and calculating the average accuracy as the accuracy of the search text.

Further, determining that the final tag system is constructed according to the accuracy of searching the text by the expanded tag system, and the method comprises the following steps:

and when the accuracy of the searched text meets the set threshold V, the current expansion tag system is the final tag system, otherwise, the numerical values of the thresholds I, II, III and IV are adjusted, the current expansion tag system is updated until the accuracy of the updated expansion tag system searched text meets the set threshold V, and the final tag system is obtained.

Further, determining that the final tag system is constructed according to the accuracy of searching the text by the expanded tag system, and the method comprises the following steps: and when the accuracy of the searched text meets a set threshold value V, the current expansion label system is the final label system, otherwise, the accuracy of the searched text after the removal of a certain label is calculated, if the accuracy is unchanged or increased compared with the accuracy obtained before the removal of the label, the label is removed from the expansion label system, all labels are traversed, and the final label system is obtained.

A judicial text label system construction system comprises a legal vocabulary module, a data acquisition module, a word segmentation module, a primary label construction module, an expansion label module, a verification label module and an optimization label module, wherein,

the legal vocabulary module stores a legal vocabulary which comprises professional vocabularies related to judicial;

the data acquisition module is used for acquiring the judicial texts and carrying out pretreatment;

the word segmentation module is used for adding the legal vocabulary into the general word segmentation tool and segmenting the judicial text provided by the data acquisition module to obtain the judicial vocabulary text;

the primary label building module is used for obtaining the judicial vocabulary text provided by the word segmentation module, counting the word frequency and the combined word frequency, and extracting the vocabulary and the combined vocabulary of which the word frequency and the combined word frequency meet a set threshold value II to serve as a primary label system;

the expansion tag module is used for storing an expansion tag dictionary, counting the similarity of tags in the primary tag system, combining the tags meeting a set threshold value III, extracting corresponding expansion words from the expansion tag dictionary, adding the expansion words into the primary tag system and obtaining the expansion tag system;

the verification label module is used for storing a sample set and a search object set, wherein the sample set comprises a plurality of problem labels and a judicial vocabulary text set X related to the problems, the search object set comprises a plurality of judicial vocabulary text sets Y, the labels of the set Y are obtained by utilizing an extended label system, the problem labels are extracted from the sample set, and the accuracy of the vocabulary texts in the set Y and the vocabulary texts in the set X searched by utilizing the problem labels is counted;

the optimized tag module judges whether the accuracy provided by the verified tag module meets a set threshold V or not, and if the accuracy meets the set threshold V, the current tag system is a final tag system; if not, adjusting a set threshold II in the primary label building module, a set threshold III in the expanded label module and a set threshold IV.

By adopting at least one technical scheme, the following beneficial effects can be achieved:

and combining legal vocabularies from various sources to construct a judicial vocabulary table, so that the word segmentation precision of the legal text is improved, and a high-precision word segmentation result is the basis of subsequent text processing.

And establishing a primary label system by using an automatic keyword extraction and part-of-speech tagging method.

Based on the layering thought, different label dictionaries are established for different laws, a label system is established, and cross interference among laws can be effectively eliminated.

A plurality of semantic correlation methods are used for expanding a label dictionary and filling a label system, so that semantic ambiguity caused by non-standard expressions such as spoken language and the like is effectively eliminated.

A large number of cases are used as a test set, a label system is optimized based on a subtraction verification method, and meanwhile, the validity of the label system is verified.

Drawings

Fig. 1 is a flowchart according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only a few embodiments of the present application, and not all embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step are within the scope of the present application.

The first embodiment provides a method for constructing a judicial text label system, which specifically comprises the following steps:

1. collecting and preprocessing judicial text data.

Collecting judicial text data, such as: the judicial official documents comprise case names, original reported information, fields of law case names, case details, applicable laws, specific laws and the like; the laws, laws and their explanatory provisions are collected corresponding to the applicable laws and specific laws in the referee's document.

And (3) preprocessing the judicial text data, removing the case details and the judicial text data with empty applicable legal fields, removing the judicial text data with the text length of the case details lower than a set case detail threshold value, and removing repeated judicial text data. For each general legal category, such as marital family, traffic safety, etc., enough cases need to be collected to ensure the diversity and comprehensiveness of the data.

2. Lexical text is obtained, which refers to a form of text characterized by a vocabulary.

The vocabulary text can be a text after word segmentation processing of a judicial official document, or a text after word segmentation processing of a text corresponding to a certain field in the judicial official document, and the vocabulary text acquisition method can adopt one or more of the following methods.

A. The vocabulary text is directly acquired, and the vocabulary text is acquired from other systems or directly input.

In one embodiment, a legal title in the vocabulary text for the marital act is: 'marriage', 'living together', 'two', 'implementation', 'home', 'family', 'violence', 'abuse', 'legacy', 'home', 'family', 'member', 'three', 'gambling', 'inhalation', 'bad habit', 'repeat', 'four', 'emotion', 'not', 'separate', 'full', 'two', 'year', 'five', 'result', 'couple', 'feeling', 'break', 'state', 'announcement', and 'missing'.

B. And acquiring a judicial text, and segmenting the judicial text by using a word segmentation tool to acquire a vocabulary text. The existing word segmentation tools, such as jieba, chulac of the university of Qinghua, hanltp of Haohang, funltp and the like, have the same word segmentation function and are all composed of a default vocabulary and a quick word segmentation algorithm, so that common words and general professional words can be successfully segmented.

In one embodiment, a judicial text is obtained, wherein a legal title about a marital law is as follows: "(one) remuneration or co-habitation with others by a spouse; (II) the violence or abuse of the family is implemented, and family members are abandoned; (III) frequent practice and modification such as gambling, drug taking and the like; (IV) the emotional disorder is complicated and the population is in two years; and (V) other conditions that lead to disruption of the couple's feelings. One party is declared lost, and the other party proposes litigation and should be granted. "

Segmenting the judicial texts by using a segmentation tool hierarchical to obtain vocabulary texts, wherein the vocabulary texts relate to a certain legal provision of the marital method, and the legal provision comprises the following steps: 'marriage', 'living together', 'two', 'implementation', 'home', 'family', 'violence', 'abuse', 'legacy', 'home', 'family', 'member', 'three', 'gambling', 'inhalation', 'bad habit', 'repeat', 'four', 'emotion', 'not', 'separate', 'full', 'two', 'year', 'five', 'result', 'couple', 'feeling', 'break', 'state', 'announcement', and 'missing'.

The existing word segmentation tools cannot exactly define words for highly professional legal words, such as 'people who limit civil performance', 'diseases which should not be married', and the like. To correctly cut out these words, custom legal vocabularies are used.

C. And constructing a judicial vocabulary table, adding the judicial vocabulary table into a user-defined dictionary of the word segmentation tool, replacing a default vocabulary table in the word segmentation tool, and segmenting the judicial text to obtain the vocabulary text. The judicial vocabulary construction method comprises the following steps:

c.1 Add entries from legal dictionaries, legal specialty word banks, etc. to the vocabulary;

c.2 Using a combined word frequency statistical algorithm to combine the conventional words to form a new vocabulary, adding the new vocabulary with the combined word frequency exceeding a set threshold into a vocabulary table, wherein the combined word frequency refers to the frequency of more than two words appearing simultaneously;

c.3 Adding the vocabulary into a self-defined dictionary of a word segmentation tool, replacing a default vocabulary in the word segmentation tool, segmenting the judicial text to obtain vocabulary text, manually rechecking the vocabulary text, checking the segmentation result one by one and checking the word frequency statistics of the word segmentation result, and supplementing the specialized vocabulary which is not segmented correctly into the vocabulary;

c.4 The reviewed vocabulary is used as the judicial vocabulary.

In one embodiment, the judicial text is segmented using a judicial vocabulary to obtain lexical text, and a legal title such as: 'remuneration', 'spouse' with spouse living with another ',' two ',' implementation ',' family violence ',' abuse ',' abandoned family member ',' three ',' gambling ',' inhalation ',' bad habit ',' time course ',' four ',' emotional disorder ',' living apart ',' full ',' two ',' year ',' five ',' cause ',' couple ',' emotional break ',' situation ',' side ',' announcement ',' lost ',' side ',' propose ',' divorce ',' lition ',' response ',' grant ',' leave ',' house ', etc'

Compared with the method of directly utilizing the word segmentation tool, the method of segmenting the judicial texts by using the judicial vocabulary can correctly segment legal professional words such as 'family violence', 'emotional rupture' and the like. And combining legal vocabularies from various sources to construct a judicial vocabulary table, so that the word segmentation precision of the legal text is improved, and a high-precision word segmentation result is the basis of subsequent text processing.

Furthermore, the part of speech of the vocabulary text is checked, nouns, verbs and adjectives are reserved, and other vocabularies are removed.

3. And selecting candidate labels according to the word frequency and/or the combined word frequency of the vocabulary text to obtain a primary label system. Word frequency refers to the frequency or number of occurrences of a single word; the combined word frequency refers to the frequency or the frequency of the simultaneous occurrence of more than two words. One or more of the following may be used.

a) Counting the word frequency of a single word in the word text, and adding the word as a candidate tag into a primary tag system when the word frequency is greater than a set threshold value until all words are counted;

b) Taking two adjacent vocabularies as combinations, counting the combined word frequency in the vocabulary text, sequencing from high to low, and taking the combined vocabularies with set quantity bits before the combined word frequency sequencing as new vocabularies to be added into a primary label system;

c) The method comprises the steps of defining window length K by using a window co-occurrence method, counting the occurrence frequency of any M vocabulary combinations by using a window traversal method, taking the vocabulary in the N combinations with the highest occurrence frequency as a keyword, counting the word frequency of a single vocabulary in the keyword, and adding the vocabulary with the word frequency exceeding a set threshold value into a primary label system as a candidate label.

Further, using regularization to screen the labels in the primary label system, namely the words in the primary label system, and eliminating non-universal words and non-label words, wherein the non-universal words are words in a preset non-universal vocabulary table, such as names; the non-tagged vocabulary is a vocabulary in a preset non-tagged vocabulary, such as an isolated verb.

Due to legal prosecution differences and the professionalism of law, the same object has different roles under different laws, for example, 'car' is a property in marital law, and represents a legal subject of 'motor vehicle' in traffic law. Therefore, different laws use different label dictionaries, and the label dictionaries of multiple laws form a label system.

The method comprises the steps of establishing a primary label system by using an automatic keyword extraction and part-of-speech tagging method, establishing different label dictionaries for different laws based on a layered thought, and establishing the label system, so that cross interference among the laws can be effectively eliminated.

4. And combining and/or expanding the labels according to the similarity of the labels in the primary label system to obtain an expanded label system. The similarity of the labels may be calculated in one or more of the following manners.

In one embodiment, a character-based tag similarity calculation method is used, where W1 and W2 denote two tags, W1= { W = ₁₁ ,w ₁₂ ,…,w _1e1 }，W2＝{w ₂₁ ,w ₂₂ ,…,w _2e2 Wherein e1 and e2 are the length of the characters contained in the labels W1 and W2, and W ₁₁ 、w ₁₂ 、w _1e1 Respectively the 1 st, 2 nd and e1 st characters, W of the label W1 ₂₁ 、w ₂₂ 、w _2e2 Respectively, the 1 st, 2 nd and e2 nd characters of the label W2.

Similarity sim (W1, W2) = the number of characters in the label W1 and the label W2 that are the same/the character length of the label W1 and the label W2 is large.

If the label 1 is a couple, the label 2 is a couple, and the character lengths are 2 and 2, respectively, wherein the characters 'husband' are the same, and the number of the same characters is 1, the similarity of the labels is 0.5.

In one embodiment, a semantic model is constructed by adopting a semantic-based label similarity calculation method and utilizing language models such as Word2Vec, glove and the like; acquiring a large number of various types of judicial texts as corpora, and training a semantic model; inputting the two labels into a semantic model, and acquiring the correlation score (W1, W2) of the two labels; and taking the correlation of the two labels as the similarity of the labels.

For example, two groups of words ('brother' ) and ('brother', 'motor vehicle'), the first group of words is clearly more relevant than the second group after training of the semantic model.

In one embodiment, a label similarity calculation method based on characters and semantics is adopted, label similarity weights p and q based on characters and semantics are set, character-based label similarity sim (W1 and W2) of labels W1 and W2 is obtained, semantic-based label similarity score (W1 and W2) of labels W1 and W2 is obtained, and the similarity of the labels is comprehensively calculated: p si m (W1, W2) + q score (W1, W2).

The primary label system is a relatively simple vocabulary list, and some vocabularies in the list may have similar semanteme and need to be merged. In addition, the vocabulary in the table cannot be effectively compatible with the semantic diversity in the actual life, and needs to be expanded.

Merging and/or expanding tags to obtain an expanded tag system may be performed in one or more of the following ways.

In one embodiment, when the similarity of two tags exceeds threshold III, or the similarity of two tags is R bits before the tag similarity values of all primary tag systems, the two tags are merged, one of the tags is retained, and the other tag is removed from the primary tag system. And when the similarity between a plurality of words in the semantic model or the synonym dictionary and the label words meets a set threshold value IV, taking the words as the expansion words of the label words, and adding the expansion words into a primary label system.

For example: the semantic model or the synonym dictionary contains 2 words of 'couple' and 'object', the label words of the primary label system are 'couple', the similarity between the words and the label words is respectively calculated, whether the threshold value IV is met is judged, wherein the 'couple' meets the condition and is used as the extension word of the 'couple'.

By tag expansion, for example, the following table is formed. The table is used for eliminating ambiguity, different expressions with the same semantic meaning are unified into the same word, and text normalization is completed.

Table 1 marriage class tag dictionary example

Table 2 example of a traffic class label dictionary

In one embodiment, an expanded vocabulary corresponding to the vocabulary in the primary label system is extracted from the expanded label dictionary and added into the primary label system, when the similarity of two labels in the primary label system exceeds a threshold value III or the similarity of two labels is R bits before the label similarity values of all the primary label systems, the two labels are merged, one of the labels is reserved, and the other label is removed from the primary label system.

5. And determining that the final label system is constructed according to the accuracy of the text searched by the extended label system.

The basic use of the text label system is text search. By contrasting the differences in search accuracy for different versions of a tag system, the utility of the tag system can be verified.

In one embodiment, a method for calculating accuracy of searching a text is provided.

5.1 Obtaining a judicial text, and extracting texts of case and law related fields in the judicial text; selecting candidate tags according to the word frequency and/or the combined word frequency of the case vocabulary text and the French vocabulary text to obtain a primary tag system; and combining and/or expanding the labels according to the similarity of the labels in the primary label system to obtain an expanded label system.

5.2 A test set is created that includes a sample set and a set of search objects. Each sample of the sample set comprises a question, n cases which are most relevant to the question and m most relevant rules. The set of search objects includes all cases and the set of applicable legal provisions.

For example, the problem of a sample set is' accident on driving, damage of taillight by non-motor vehicle, compensation? ', the most relevant cases 3 to the problem, and the most relevant statutes 6.

5.3 Draw the text labels of questions, cases, and french in the sample set to form a label vector.

5.4 Cases similar to the problem and applicable laws in the search object set are recommended by using a vector matching method, wherein the vector similarity is calculated by using an Euler distance, the vectors are subtracted and modulo is the vector distance, and the Euler distance is the most commonly used vector distance calculation method.

5.5 Calculating accuracy by recommending case and law comparison of case and law corresponding to the sample set, wherein the accuracy is represented by an average value of recall rate and accuracy, the recall rate is also called recall rate, and the recall rate = the number of correct samples/the number of all correct samples in the data set; the accuracy rate is also called precision rate, and the accuracy rate = the number of samples found correctly/the number of samples found in total.

For example, there are 5 recommendations, the correct result is 2, and the recall rate is 40%; the test set has 10 samples, and the recommended result for 5 samples is the same as the true value, and the accuracy is 50%.

Table 3 search object set example

Case 1 label	Case 1 applicable French stripe	First line of XX method	First label
				Case 2 label	Case 2 applicable French stripe	Second line of XX method	Second label
…	…	…	…
				Case N label	Case N applicable law	Other methods the first	N label

TABLE 4 search results vs. true values example

In one embodiment, a method of calculating accuracy of searching a text.

Presetting a sample set and a search object set, wherein the sample set SS comprises NC samples and a sample S _i Including a search question Q _i And a vocabulary text collection X related to the search problem _i Vocabulary text set X _i Comprising Hi vocabulary texts, xi = { x _i1 ,x _i2 ,…,x _iHi }; the search object set Y includes NS vocabulary texts, Y = { Y = { (Y) ₁ ,y ₂ ,…,y _NS }；

Obtaining an extension tag Z, Z = { Z) of a search object set Y by using an extension tag system ₁ ,z ₂ ,…,z _NS }；

Sequentially extracting a sample S from the sample set _i Obtaining a search question Q _i Tag vector T of _i ；

Further, when the accuracy of the searched text is greater than the threshold value V, the current expanded label system is the final label system, otherwise, the label system is optimized.

The optimized label system can adopt one or a combination of several methods:

1) And adjusting the numerical values of the threshold values I, II, III and IV, and updating the extended label system until the accuracy of the search text of the current extended label system is greater than the threshold value V, so as to obtain the final label system.

2) And adjusting the legal vocabulary, the similarity calculation method of the tags and the accuracy calculation method of the search text, updating the extended tag system until the accuracy of the search text of the current extended tag system is greater than a threshold V, and obtaining a final tag system.

3) And taking the current extended label system as an object, calculating the accuracy of the search text after the removal of a certain label, if the accuracy is unchanged or increased compared with the accuracy obtained before the removal of the label, removing the label from the extended label system, and traversing all labels to obtain a final label system.

The second embodiment provides a judicial text label system construction system, which comprises a legal vocabulary module, a data acquisition module, a word segmentation module, a primary label construction module, an extension label module, a verification label module and an optimization label module, wherein,

the legal vocabulary module stores a legal vocabulary which comprises professional vocabularies related to judicial law;

the word segmentation module is used for adding the legal vocabulary into the general word segmentation tool and segmenting the judicial texts provided by the data acquisition module to obtain the judicial vocabulary texts;

the expansion tag module is used for storing an expansion tag dictionary, counting the similarity of tags in the primary tag system, combining the tags meeting a set threshold value III, extracting a corresponding expansion vocabulary from the expansion tag dictionary, adding the expansion vocabulary into the primary tag system and obtaining an expansion tag system;

the label optimizing module judges whether the accuracy provided by the verification label module meets a set threshold V or not, and if yes, the current label system is a final label system; if not, adjusting a set threshold II in the primary label building module, a set threshold III in the expanded label module and a set threshold IV.

Referring to fig. 1, a data processing flow of a judicial text label system construction system is as follows:

collecting about 16 ten thousand judicial texts of civil referee documents in the last 10 years, including marital referee documents and traffic referee documents, and performing data preprocessing, including: removing case details and judicial text data with empty applicable law fields, removing the judicial text data with the text length of the case details lower than a set case detail threshold value, removing repeated judicial text data, and independently extracting the case details, the applicable laws and the texts of specific judicial fields in the judicial text. The method comprises the steps of collecting a plurality of common civil laws 170, and extracting the text of the legal terms and the text of the specific specified two fields.

And (4) performing word segmentation, namely adding the legal vocabulary into a general word segmentation tool by using a word segmentation module, and segmenting the judicial texts after data preprocessing to obtain the judicial vocabulary texts.

And constructing a primary label, and extracting a vocabulary with the word frequency meeting a set threshold value as the primary label.

And the expansion tag extracts a corresponding expansion vocabulary from the expansion tag dictionary.

And verifying the tags, namely verifying through the corresponding relation between the legal case and the law, and contrasting the difference of the expansion tags of different versions in the searching accuracy.

Optimizing the label, judging whether the label after the label is verified meets the requirement, and if so, finishing the construction of a label system; if not, feeding back to the verification tag module.

While the present application has been described by way of examples, those of ordinary skill in the art will appreciate that there are numerous variations and permutations of the present application that do not depart from the spirit of the present application and that the appended embodiments are intended to include such variations and permutations without departing from the present application.

Claims

1. A judicial text label system construction method is characterized by comprising the following steps:

and determining that the final label system is constructed according to the accuracy of the text searched by the extended label system, wherein the accuracy of the text searched by the extended label system is calculated by the following steps: set up sample set and search object set, the sample set includes a plurality of problem label and the judicial vocabulary text set X relevant with the problem, search object set includes a plurality of judicial vocabulary text set Y, utilizes the extension label system acquires the label of set Y, follows the sample is concentrated and is extracted the problem label, and the statistics utilizes the problem label to search out the accuracy degree of the vocabulary text in the set Y and the vocabulary text in the set X.

2. The method of claim 1, wherein the obtaining of the lexical text comprises: constructing a judicial vocabulary, adding the judicial vocabulary into a custom dictionary of a word segmentation tool, and segmenting a judicial text to obtain a vocabulary text; wherein, the constructing the judicial vocabulary comprises:

adding the vocabulary of the legal dictionary and the vocabulary of the legal professional lexicon into a prepared vocabulary list;

counting the combined word frequency of the conventional words, and adding the conventional word combination with the combined word frequency meeting a set threshold I into the prepared vocabulary list as a new word;

and obtaining the judicial vocabulary.

3. The method for constructing a judicial text label system according to claim 1, wherein selecting candidate labels according to the vocabulary text word frequency and the combined word frequency to obtain a primary label system comprises:

defining the window length K, counting the occurrence times of any M vocabulary combinations by using a window traversal method, taking the vocabulary in the N combinations with the highest occurrence times as a keyword, counting the word frequency of a single vocabulary in the keyword, taking the vocabulary with the word frequency meeting a set threshold II as a candidate tag, and adding the candidate tag into the primary tag system.

4. The method for constructing a judicial text label system according to claim 1, wherein the similarity of labels is calculated by a method comprising:

obtaining a tag similarity score (W1, W2) of tags W1 and W2 based on semantics, wherein the score (W1, W2) is a correlation value of the tags W1 and W2, and the correlation value is obtained from a semantic model trained by using a judicial text as a corpus;

the similarity of the tags = p si m (W1, W2) + q score (W1, W2) was calculated.

5. The method of claim 1, wherein the tag is a tag of a judicial text,

the merging of the labels is specifically to merge the two labels when the similarity of the two labels meets a set threshold value III or the similarity of the two labels is R bits before the label similarity value of the primary label system, reserve one of the labels, and remove the other label from the primary label system;

specifically, when the similarity between a plurality of words and label words in a semantic model or a synonym dictionary meets a set threshold value IV, the words are used as the extension words of the label words, and the extension words are added into a primary label system.

6. The method for constructing a judicial text label system according to claim 1, wherein: the accuracy of the search text is calculated by the following method:

establishing a test set, wherein the test set comprises a sample set and a search object set, each sample of the sample set comprises a question, n cases most relevant to the question and m legal rules most relevant to the question, and the search object set comprises all case and legal rule sets;

extracting text labels of problems, cases and statutes in a sample set to form label vectors;

calculating accuracy by recommending case and law comparison of case and law with case and law corresponding to the sample set, wherein the accuracy is represented by an average value of recall rate and accuracy, the recall rate is also called recall rate, and the recall rate = the number of searched correct samples/the number of all correct samples in the data set; the accuracy rate is also called precision rate, and the accuracy rate = the number of correct samples found/the number of samples found in total.

7. The method for constructing a judicial text label architecture according to claim 1, wherein the accuracy of the search text is calculated by:

presetting a sample set and a search object set, wherein the sample set SS comprises NC samples and a sample S _i Including a search question Q _i And a vocabulary text collection X related to the search problem _i Said set of lexical texts X _i Comprising H _i Word and phraseThis, X _i ={x _i1 ,x _i2 ,…,x _iHi }; the set of search objects Y comprises NS lexical texts, Y = { Y = { (Y) ₁ ,y ₂ ,…,y _NS }；

Obtaining the extension label Z, Z = { Z) of the search object set Y by using the extension label system ₁ ,z ₂ ,…,z _NS }；

Computing a tag vector T _i And extension tag Z _j Taking the vocabulary texts corresponding to the first Hi expansion tags with the highest similarity to form a contrast group T;

calculate single search accuracy = control T and set X _i The number/Hi of the Chinese vocabulary texts is the same;

8. The method for constructing a judicial text tag architecture according to claim 1, wherein determining that the final tag architecture is constructed according to the accuracy of the search text of the extended tag architecture comprises: when the accuracy of the searched text meets a set threshold V, the current expansion tag system is a final tag system, otherwise, the numerical values of thresholds I, II, III and IV are adjusted, the current expansion tag system is updated until the accuracy of the searched text of the updated expansion tag system meets the set threshold V, and the final tag system is obtained, wherein the threshold I is a combined word frequency which needs to be met by a conventional word combination which is used as a new word and added into a prepared vocabulary table, the threshold II is a word frequency which needs to be met by a word which is used as the candidate tag, the threshold III is the similarity which needs to be met by the tag combination, and the threshold IV is the similarity which needs to be met by an expansion word which is used as the tag and the tag.

9. The method of claim 1, wherein determining that the final tag architecture is complete according to the accuracy of the extended tag architecture search text comprises: and when the accuracy of the searched text meets a set threshold value V, the current expansion label system is the final label system, otherwise, the accuracy of the searched text after the removal of a certain label is calculated, if the accuracy is unchanged or increased compared with the accuracy obtained before the removal of the label, the label is removed from the expansion label system, all labels are traversed, and the final label system is obtained.

10. A judicial text label system construction system comprises a legal vocabulary module, a data acquisition module, a word segmentation module, a primary label construction module, an expansion label module, a verification label module and an optimization label module, wherein,

the data acquisition module acquires a judicial text and performs preprocessing;

the word segmentation module adds the legal vocabulary into a general word segmentation tool, and segments the judicial texts provided by the data acquisition module to obtain the judicial vocabulary texts;

the primary label building module is used for obtaining the judicial vocabulary text provided by the word segmentation module, counting word frequency and combined word frequency, and extracting vocabularies and combined vocabularies of which the word frequency and the combined word frequency meet a set threshold value II to serve as a primary label system;

the extended tag module is used for storing an extended tag dictionary, counting the similarity of tags in the primary tag system, combining the tags meeting a set threshold value III, extracting a corresponding extended vocabulary from the extended tag dictionary, and adding the extended vocabulary into the primary tag system to obtain an extended tag system;

the verification label module is used for storing a sample set and a search object set, wherein the sample set comprises a plurality of problem labels and a judicial vocabulary text set X related to problems, the search object set comprises a plurality of judicial vocabulary text sets Y, the labels of the set Y are obtained by using the extended label system, the problem labels are extracted from the sample set, and the accuracy of the vocabulary texts in the set Y and the vocabulary texts in the set X searched by using the problem labels is counted;

the optimized tag module judges whether the accuracy provided by the verification tag module meets a set threshold value V, and if the accuracy meets the set threshold value V, the current tag system is a final tag system; if not, adjusting a set threshold II in the primary label building module, a set threshold III in the extension label module and a set threshold IV.