CN109543178B - Method and system for constructing judicial text label system - Google Patents

Method and system for constructing judicial text label system Download PDF

Info

Publication number
CN109543178B
CN109543178B CN201811294777.8A CN201811294777A CN109543178B CN 109543178 B CN109543178 B CN 109543178B CN 201811294777 A CN201811294777 A CN 201811294777A CN 109543178 B CN109543178 B CN 109543178B
Authority
CN
China
Prior art keywords
label
vocabulary
text
tag
judicial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811294777.8A
Other languages
Chinese (zh)
Other versions
CN109543178A (en
Inventor
丁锴
李建元
陈涛
王开红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yinjiang Technology Co ltd
Original Assignee
Yinjiang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yinjiang Technology Co ltd filed Critical Yinjiang Technology Co ltd
Priority to CN201811294777.8A priority Critical patent/CN109543178B/en
Publication of CN109543178A publication Critical patent/CN109543178A/en
Application granted granted Critical
Publication of CN109543178B publication Critical patent/CN109543178B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Tourism & Hospitality (AREA)
  • Probability & Statistics with Applications (AREA)
  • Technology Law (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a method and a system for constructing a judicial text label system. Acquiring a judicial vocabulary text through a word segmentation tool, constructing a primary label system according to word frequency statistics, merging labels with similar semantics in the primary label system, expanding the unsmooth labels to obtain an expanded label system, counting the accuracy of searching the text of the expanded label system by using a text test set, verifying whether the current expanded label system is constructed, and otherwise, further optimizing the label system. The method realizes the construction of a targeted label system for different laws, and greatly improves the search precision of the judicial texts.

Description

Method and system for constructing judicial text label system
Technical Field
The application relates to the field of natural language processing, in particular to a method and a system for constructing a judicial text label system.
Background
With the public and transparent nature of the legal field, more and more official documents are placed under the supervision of the public. According to statistics of a Chinese judge document network, over 5 thousand documents are on line at present, and the scale is increased by about 3 ten thousand every day. However, the increase of legal text resources also brings a series of problems, such as the problems of larger and larger storage capacity, slower and slower search speed, and the search result is not the desired information. These problems result in a reduction in the efficiency of use of legal text resources. In order to solve these problems, legal texts are handled. A common method for processing mass data of the Internet is to carry out data tagging, namely a Vector Space Model (Vector Space Model). The data is processed into a series of keywords (Term) or tags, which are then used to generate an index code. Legal text processing also uses this model, except for how the tags are defined.
There has been a lot of work on text label extraction. Patent CN201510697001 proposes to dig out notification short messages by writing regular expressions for existing short message texts; using the mined XX as identity label information of a short message text; and for the excavated short message text identity of the notification type, the identity label information with the highest frequency is taken as the final identity label information of the service number in a threshold value mode. And the identity label can be updated in real time when a new short message arrives. Patent CN201710541481 proposes a text label generation method, which performs keyword extraction by respectively adopting strategies corresponding to each label type for a target text to obtain candidate labels of each label type of the target text, then performs cross validation on the candidate labels of each label type among different label types, and finally determines a target label of the target text according to the validated candidate labels. The label extraction is carried out respectively aiming at different label types including entity words, segment texts and/or topics, and cross verification is carried out, so that the label extraction accuracy is improved, and the technical problem of low label extraction accuracy in the prior art is solved. Patent CN201711213971 proposes a method for generating text label words. Firstly, extracting label words in a text, and generating correlated grouped label words according to the extracted label words and a preset label word relation; aggregating the grouped tag words according to the incidence relation among the grouped tag words, and searching the aggregated grouped tag words which can be completely covered by the text in a preset tag word dictionary to obtain combined tag words; and finally, generating mapping label words in the text according to the combined label words and the preset label word relation. The corresponding label words can be generated for the text quickly and independently according to actual requirements without the intervention of professional personnel. CN201510197328 proposes a text label extraction method, which includes, first, performing text category prediction, then performing topic prediction through a topic clustering model to obtain a prediction topic, then, extracting text keywords, and finally, taking text target categories, target topics, and target keywords as labels of the text. The text labels have different levels, so that the retrieval requirements of different granularities are met, and the recommended articles of different granularities can be provided according to different labels.
Due to the characteristics of more professional words of legal texts, high coincidence rate of case dispute points and the like, the text label extraction method cannot meet the accuracy requirement. Therefore, a new label system is provided, a label dictionary is established through a series of regularizations, and the label dictionary is verified and optimized through the corresponding relation between legal cases and laws, so that the search precision of legal texts is improved.
Disclosure of Invention
The invention provides a method and a system for constructing a judicial text label system, aiming at the problems of more professional words of legal texts, high coincidence degree of case dispute points and the like. Due to the combination of the advantages of machine learning and rechecking, the accuracy of legal text retrieval can be obviously improved on the basis of reducing manual intervention.
A method for constructing a judicial text label system is characterized by comprising the following steps:
acquiring a vocabulary text, wherein the vocabulary text refers to a form of representing a text by a vocabulary;
selecting candidate labels according to the word frequency and/or the combined word frequency of the vocabulary text to obtain a primary label system;
merging and/or expanding labels according to the similarity of the labels in the primary label system to obtain an expanded label system;
and determining that the final label system is constructed according to the accuracy of the text searched by the expanded label system.
Further, obtaining vocabulary text, comprising: constructing a judicial vocabulary, adding the judicial vocabulary into a custom dictionary of a word segmentation tool, and segmenting a judicial text to obtain a vocabulary text;
wherein, the constructing the judicial vocabulary comprises:
adding entries of a legal dictionary, a legal professional lexicon and the like into a prepared vocabulary;
counting the combined word frequency of the conventional words, and adding the conventional word combination with the combined word frequency meeting a set threshold value I into a prepared vocabulary table as a new word;
rechecking, namely adding the unsingulated correct professional vocabulary into the prepared vocabulary;
a judicial vocabulary is obtained.
Further, according to the word frequency and the combined word frequency of the vocabulary text, selecting a candidate tag to obtain a primary tag system, comprising:
defining the window length K, counting the occurrence times of any M vocabulary combinations by using a window traversal method, taking the vocabulary in the N combinations with the highest occurrence times as a keyword, counting the word frequency of a single vocabulary in the keyword, taking the vocabulary with the word frequency meeting a set threshold II as a candidate tag, and adding the candidate tag into a primary tag system.
Further, the similarity of the labels is calculated by the method comprising the following steps:
setting a character-based label similarity weight p and a semantic-based label similarity weight q;
acquiring label similarity sim (W1, W2) of labels W1 and W2 based on characters, wherein sim (W1, W2) = the number of the same characters in label W1 and label W2/the larger value of the character length of label W1 and label W2;
obtaining tag similarity score (W1, W2) of tags W1 and W2 based on semantics, wherein score (W1, W2) is a correlation value of tag W1 and tag W2, and the correlation value is obtained from a semantic model trained by using a judicial text as a corpus;
the similarity of the tags = p × sim (W1, W2) + q × score (W1, W2) was calculated.
Further, the air conditioner is provided with a fan,
merging the labels, specifically, when the similarity of the two labels meets a set threshold value III, or the similarity of the two labels is R bits before the label similarity value of the primary label system, merging the two labels, retaining one of the labels, and removing the other label from the primary label system;
and expanding the label, specifically, when the similarity between a plurality of words and label words in the semantic model or the synonym dictionary meets a set threshold value IV, taking the words as the expanded words of the label words, and adding the expanded words into a primary label system.
Further, the accuracy of the search text is calculated by:
and establishing a test set, wherein the test set comprises a sample set and a search object set. The sample set comprises one question and n cases most relevant to the question and m pieces of law most relevant to the question. The search object set comprises all case and legal system sets;
extracting text labels of problems, cases and legal notes in a sample set to form a label vector;
recommending cases similar to the problems and applicable laws in the search object set by using a vector matching method, wherein the vector similarity is calculated by using an Euler distance;
calculating accuracy by recommending comparison of cases and law bars with cases and law bars corresponding to the sample set, wherein the accuracy is represented by using an average value of recall rate and accuracy, the recall rate is also called recall rate, and the recall rate = the correct number of samples/the correct number of samples in the data set; the accuracy is also called precision, and the accuracy = number of detected samples/number of detected samples.
Further, the accuracy of the search text is calculated by:
presetting a sample set and a search object set, wherein the sample set SS comprises NC samples and a sample S i Including a search question Q i And a vocabulary text set X related to the search problem i Said set of lexical texts X i Comprising Hi vocabulary texts, xi = { x i1 ,x i2 ,…,x iHi }; the search object set Y includes NS vocabulary texts, Y = { Y = { (Y) 1 ,y 2 ,…,y NS };
Obtaining the extension label Z, Z = { Z ] of the search object set Y by utilizing an extension label system 1 ,z 2 ,…,z NS };
Sequentially extracting a sample S from the sample set i Obtaining the search question Q i Tag vector T of i
Computing a tag vector T i And extension tag Z j Taking vocabulary texts corresponding to the top Hi expansion labels with the highest similarity to form a comparison group T;
calculating single search accuracy = number/Hi of the control group T equal to the number of vocabulary texts in the set Xi;
and traversing the whole sample set, and calculating the average accuracy as the accuracy of the search text.
Further, determining that the final tag system is constructed according to the accuracy of searching the text by the expanded tag system, and the method comprises the following steps:
and when the accuracy of the searched text meets the set threshold V, the current expansion tag system is the final tag system, otherwise, the numerical values of the thresholds I, II, III and IV are adjusted, the current expansion tag system is updated until the accuracy of the updated expansion tag system searched text meets the set threshold V, and the final tag system is obtained.
Further, determining that the final tag system is constructed according to the accuracy of searching the text by the expanded tag system, and the method comprises the following steps: and when the accuracy of the searched text meets a set threshold value V, the current expansion label system is the final label system, otherwise, the accuracy of the searched text after the removal of a certain label is calculated, if the accuracy is unchanged or increased compared with the accuracy obtained before the removal of the label, the label is removed from the expansion label system, all labels are traversed, and the final label system is obtained.
A judicial text label system construction system comprises a legal vocabulary module, a data acquisition module, a word segmentation module, a primary label construction module, an expansion label module, a verification label module and an optimization label module, wherein,
the legal vocabulary module stores a legal vocabulary which comprises professional vocabularies related to judicial;
the data acquisition module is used for acquiring the judicial texts and carrying out pretreatment;
the word segmentation module is used for adding the legal vocabulary into the general word segmentation tool and segmenting the judicial text provided by the data acquisition module to obtain the judicial vocabulary text;
the primary label building module is used for obtaining the judicial vocabulary text provided by the word segmentation module, counting the word frequency and the combined word frequency, and extracting the vocabulary and the combined vocabulary of which the word frequency and the combined word frequency meet a set threshold value II to serve as a primary label system;
the expansion tag module is used for storing an expansion tag dictionary, counting the similarity of tags in the primary tag system, combining the tags meeting a set threshold value III, extracting corresponding expansion words from the expansion tag dictionary, adding the expansion words into the primary tag system and obtaining the expansion tag system;
the verification label module is used for storing a sample set and a search object set, wherein the sample set comprises a plurality of problem labels and a judicial vocabulary text set X related to the problems, the search object set comprises a plurality of judicial vocabulary text sets Y, the labels of the set Y are obtained by utilizing an extended label system, the problem labels are extracted from the sample set, and the accuracy of the vocabulary texts in the set Y and the vocabulary texts in the set X searched by utilizing the problem labels is counted;
the optimized tag module judges whether the accuracy provided by the verified tag module meets a set threshold V or not, and if the accuracy meets the set threshold V, the current tag system is a final tag system; if not, adjusting a set threshold II in the primary label building module, a set threshold III in the expanded label module and a set threshold IV.
By adopting at least one technical scheme, the following beneficial effects can be achieved:
and combining legal vocabularies from various sources to construct a judicial vocabulary table, so that the word segmentation precision of the legal text is improved, and a high-precision word segmentation result is the basis of subsequent text processing.
And establishing a primary label system by using an automatic keyword extraction and part-of-speech tagging method.
Based on the layering thought, different label dictionaries are established for different laws, a label system is established, and cross interference among laws can be effectively eliminated.
A plurality of semantic correlation methods are used for expanding a label dictionary and filling a label system, so that semantic ambiguity caused by non-standard expressions such as spoken language and the like is effectively eliminated.
A large number of cases are used as a test set, a label system is optimized based on a subtraction verification method, and meanwhile, the validity of the label system is verified.
Drawings
Fig. 1 is a flowchart according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only a few embodiments of the present application, and not all embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step are within the scope of the present application.
The first embodiment provides a method for constructing a judicial text label system, which specifically comprises the following steps:
1. collecting and preprocessing judicial text data.
Collecting judicial text data, such as: the judicial official documents comprise case names, original reported information, fields of law case names, case details, applicable laws, specific laws and the like; the laws, laws and their explanatory provisions are collected corresponding to the applicable laws and specific laws in the referee's document.
And (3) preprocessing the judicial text data, removing the case details and the judicial text data with empty applicable legal fields, removing the judicial text data with the text length of the case details lower than a set case detail threshold value, and removing repeated judicial text data. For each general legal category, such as marital family, traffic safety, etc., enough cases need to be collected to ensure the diversity and comprehensiveness of the data.
2. Lexical text is obtained, which refers to a form of text characterized by a vocabulary.
The vocabulary text can be a text after word segmentation processing of a judicial official document, or a text after word segmentation processing of a text corresponding to a certain field in the judicial official document, and the vocabulary text acquisition method can adopt one or more of the following methods.
A. The vocabulary text is directly acquired, and the vocabulary text is acquired from other systems or directly input.
In one embodiment, a legal title in the vocabulary text for the marital act is: 'marriage', 'living together', 'two', 'implementation', 'home', 'family', 'violence', 'abuse', 'legacy', 'home', 'family', 'member', 'three', 'gambling', 'inhalation', 'bad habit', 'repeat', 'four', 'emotion', 'not', 'separate', 'full', 'two', 'year', 'five', 'result', 'couple', 'feeling', 'break', 'state', 'announcement', and 'missing'.
B. And acquiring a judicial text, and segmenting the judicial text by using a word segmentation tool to acquire a vocabulary text. The existing word segmentation tools, such as jieba, chulac of the university of Qinghua, hanltp of Haohang, funltp and the like, have the same word segmentation function and are all composed of a default vocabulary and a quick word segmentation algorithm, so that common words and general professional words can be successfully segmented.
In one embodiment, a judicial text is obtained, wherein a legal title about a marital law is as follows: "(one) remuneration or co-habitation with others by a spouse; (II) the violence or abuse of the family is implemented, and family members are abandoned; (III) frequent practice and modification such as gambling, drug taking and the like; (IV) the emotional disorder is complicated and the population is in two years; and (V) other conditions that lead to disruption of the couple's feelings. One party is declared lost, and the other party proposes litigation and should be granted. "
Segmenting the judicial texts by using a segmentation tool hierarchical to obtain vocabulary texts, wherein the vocabulary texts relate to a certain legal provision of the marital method, and the legal provision comprises the following steps: 'marriage', 'living together', 'two', 'implementation', 'home', 'family', 'violence', 'abuse', 'legacy', 'home', 'family', 'member', 'three', 'gambling', 'inhalation', 'bad habit', 'repeat', 'four', 'emotion', 'not', 'separate', 'full', 'two', 'year', 'five', 'result', 'couple', 'feeling', 'break', 'state', 'announcement', and 'missing'.
The existing word segmentation tools cannot exactly define words for highly professional legal words, such as 'people who limit civil performance', 'diseases which should not be married', and the like. To correctly cut out these words, custom legal vocabularies are used.
C. And constructing a judicial vocabulary table, adding the judicial vocabulary table into a user-defined dictionary of the word segmentation tool, replacing a default vocabulary table in the word segmentation tool, and segmenting the judicial text to obtain the vocabulary text. The judicial vocabulary construction method comprises the following steps:
c.1 Add entries from legal dictionaries, legal specialty word banks, etc. to the vocabulary;
c.2 Using a combined word frequency statistical algorithm to combine the conventional words to form a new vocabulary, adding the new vocabulary with the combined word frequency exceeding a set threshold into a vocabulary table, wherein the combined word frequency refers to the frequency of more than two words appearing simultaneously;
c.3 Adding the vocabulary into a self-defined dictionary of a word segmentation tool, replacing a default vocabulary in the word segmentation tool, segmenting the judicial text to obtain vocabulary text, manually rechecking the vocabulary text, checking the segmentation result one by one and checking the word frequency statistics of the word segmentation result, and supplementing the specialized vocabulary which is not segmented correctly into the vocabulary;
c.4 The reviewed vocabulary is used as the judicial vocabulary.
In one embodiment, the judicial text is segmented using a judicial vocabulary to obtain lexical text, and a legal title such as: 'remuneration', 'spouse' with spouse living with another ',' two ',' implementation ',' family violence ',' abuse ',' abandoned family member ',' three ',' gambling ',' inhalation ',' bad habit ',' time course ',' four ',' emotional disorder ',' living apart ',' full ',' two ',' year ',' five ',' cause ',' couple ',' emotional break ',' situation ',' side ',' announcement ',' lost ',' side ',' propose ',' divorce ',' lition ',' response ',' grant ',' leave ',' house ', etc'
Compared with the method of directly utilizing the word segmentation tool, the method of segmenting the judicial texts by using the judicial vocabulary can correctly segment legal professional words such as 'family violence', 'emotional rupture' and the like. And combining legal vocabularies from various sources to construct a judicial vocabulary table, so that the word segmentation precision of the legal text is improved, and a high-precision word segmentation result is the basis of subsequent text processing.
Furthermore, the part of speech of the vocabulary text is checked, nouns, verbs and adjectives are reserved, and other vocabularies are removed.
3. And selecting candidate labels according to the word frequency and/or the combined word frequency of the vocabulary text to obtain a primary label system. Word frequency refers to the frequency or number of occurrences of a single word; the combined word frequency refers to the frequency or the frequency of the simultaneous occurrence of more than two words. One or more of the following may be used.
a) Counting the word frequency of a single word in the word text, and adding the word as a candidate tag into a primary tag system when the word frequency is greater than a set threshold value until all words are counted;
b) Taking two adjacent vocabularies as combinations, counting the combined word frequency in the vocabulary text, sequencing from high to low, and taking the combined vocabularies with set quantity bits before the combined word frequency sequencing as new vocabularies to be added into a primary label system;
c) The method comprises the steps of defining window length K by using a window co-occurrence method, counting the occurrence frequency of any M vocabulary combinations by using a window traversal method, taking the vocabulary in the N combinations with the highest occurrence frequency as a keyword, counting the word frequency of a single vocabulary in the keyword, and adding the vocabulary with the word frequency exceeding a set threshold value into a primary label system as a candidate label.
Further, using regularization to screen the labels in the primary label system, namely the words in the primary label system, and eliminating non-universal words and non-label words, wherein the non-universal words are words in a preset non-universal vocabulary table, such as names; the non-tagged vocabulary is a vocabulary in a preset non-tagged vocabulary, such as an isolated verb.
Due to legal prosecution differences and the professionalism of law, the same object has different roles under different laws, for example, 'car' is a property in marital law, and represents a legal subject of 'motor vehicle' in traffic law. Therefore, different laws use different label dictionaries, and the label dictionaries of multiple laws form a label system.
The method comprises the steps of establishing a primary label system by using an automatic keyword extraction and part-of-speech tagging method, establishing different label dictionaries for different laws based on a layered thought, and establishing the label system, so that cross interference among the laws can be effectively eliminated.
4. And combining and/or expanding the labels according to the similarity of the labels in the primary label system to obtain an expanded label system. The similarity of the labels may be calculated in one or more of the following manners.
In one embodiment, a character-based tag similarity calculation method is used, where W1 and W2 denote two tags, W1= { W = 11 ,w 12 ,…,w 1e1 },W2={w 21 ,w 22 ,…,w 2e2 Wherein e1 and e2 are the length of the characters contained in the labels W1 and W2, and W 11 、w 12 、w 1e1 Respectively the 1 st, 2 nd and e1 st characters, W of the label W1 21 、w 22 、w 2e2 Respectively, the 1 st, 2 nd and e2 nd characters of the label W2.
Similarity sim (W1, W2) = the number of characters in the label W1 and the label W2 that are the same/the character length of the label W1 and the label W2 is large.
If the label 1 is a couple, the label 2 is a couple, and the character lengths are 2 and 2, respectively, wherein the characters 'husband' are the same, and the number of the same characters is 1, the similarity of the labels is 0.5.
In one embodiment, a semantic model is constructed by adopting a semantic-based label similarity calculation method and utilizing language models such as Word2Vec, glove and the like; acquiring a large number of various types of judicial texts as corpora, and training a semantic model; inputting the two labels into a semantic model, and acquiring the correlation score (W1, W2) of the two labels; and taking the correlation of the two labels as the similarity of the labels.
For example, two groups of words ('brother' ) and ('brother', 'motor vehicle'), the first group of words is clearly more relevant than the second group after training of the semantic model.
In one embodiment, a label similarity calculation method based on characters and semantics is adopted, label similarity weights p and q based on characters and semantics are set, character-based label similarity sim (W1 and W2) of labels W1 and W2 is obtained, semantic-based label similarity score (W1 and W2) of labels W1 and W2 is obtained, and the similarity of the labels is comprehensively calculated: p si m (W1, W2) + q score (W1, W2).
The primary label system is a relatively simple vocabulary list, and some vocabularies in the list may have similar semanteme and need to be merged. In addition, the vocabulary in the table cannot be effectively compatible with the semantic diversity in the actual life, and needs to be expanded.
Merging and/or expanding tags to obtain an expanded tag system may be performed in one or more of the following ways.
In one embodiment, when the similarity of two tags exceeds threshold III, or the similarity of two tags is R bits before the tag similarity values of all primary tag systems, the two tags are merged, one of the tags is retained, and the other tag is removed from the primary tag system. And when the similarity between a plurality of words in the semantic model or the synonym dictionary and the label words meets a set threshold value IV, taking the words as the expansion words of the label words, and adding the expansion words into a primary label system.
For example: the semantic model or the synonym dictionary contains 2 words of 'couple' and 'object', the label words of the primary label system are 'couple', the similarity between the words and the label words is respectively calculated, whether the threshold value IV is met is judged, wherein the 'couple' meets the condition and is used as the extension word of the 'couple'.
By tag expansion, for example, the following table is formed. The table is used for eliminating ambiguity, different expressions with the same semantic meaning are unified into the same word, and text normalization is completed.
Table 1 marriage class tag dictionary example
Figure BDA0001850878720000061
Table 2 example of a traffic class label dictionary
Figure BDA0001850878720000062
Figure BDA0001850878720000071
In one embodiment, an expanded vocabulary corresponding to the vocabulary in the primary label system is extracted from the expanded label dictionary and added into the primary label system, when the similarity of two labels in the primary label system exceeds a threshold value III or the similarity of two labels is R bits before the label similarity values of all the primary label systems, the two labels are merged, one of the labels is reserved, and the other label is removed from the primary label system.
5. And determining that the final label system is constructed according to the accuracy of the text searched by the extended label system.
The basic use of the text label system is text search. By contrasting the differences in search accuracy for different versions of a tag system, the utility of the tag system can be verified.
In one embodiment, a method for calculating accuracy of searching a text is provided.
5.1 Obtaining a judicial text, and extracting texts of case and law related fields in the judicial text; selecting candidate tags according to the word frequency and/or the combined word frequency of the case vocabulary text and the French vocabulary text to obtain a primary tag system; and combining and/or expanding the labels according to the similarity of the labels in the primary label system to obtain an expanded label system.
5.2 A test set is created that includes a sample set and a set of search objects. Each sample of the sample set comprises a question, n cases which are most relevant to the question and m most relevant rules. The set of search objects includes all cases and the set of applicable legal provisions.
For example, the problem of a sample set is' accident on driving, damage of taillight by non-motor vehicle, compensation? ', the most relevant cases 3 to the problem, and the most relevant statutes 6.
5.3 Draw the text labels of questions, cases, and french in the sample set to form a label vector.
5.4 Cases similar to the problem and applicable laws in the search object set are recommended by using a vector matching method, wherein the vector similarity is calculated by using an Euler distance, the vectors are subtracted and modulo is the vector distance, and the Euler distance is the most commonly used vector distance calculation method.
5.5 Calculating accuracy by recommending case and law comparison of case and law corresponding to the sample set, wherein the accuracy is represented by an average value of recall rate and accuracy, the recall rate is also called recall rate, and the recall rate = the number of correct samples/the number of all correct samples in the data set; the accuracy rate is also called precision rate, and the accuracy rate = the number of samples found correctly/the number of samples found in total.
For example, there are 5 recommendations, the correct result is 2, and the recall rate is 40%; the test set has 10 samples, and the recommended result for 5 samples is the same as the true value, and the accuracy is 50%.
Table 3 search object set example
Case 1 label Case 1 applicable French stripe First line of XX method First label
Case 2 label Case 2 applicable French stripe Second line of XX method Second label
Case N label Case N applicable law Other methods the first N label
TABLE 4 search results vs. true values example
Figure BDA0001850878720000072
Figure BDA0001850878720000081
In one embodiment, a method of calculating accuracy of searching a text.
Presetting a sample set and a search object set, wherein the sample set SS comprises NC samples and a sample S i Including a search question Q i And a vocabulary text collection X related to the search problem i Vocabulary text set X i Comprising Hi vocabulary texts, xi = { x i1 ,x i2 ,…,x iHi }; the search object set Y includes NS vocabulary texts, Y = { Y = { (Y) 1 ,y 2 ,…,y NS };
Obtaining an extension tag Z, Z = { Z) of a search object set Y by using an extension tag system 1 ,z 2 ,…,z NS };
Sequentially extracting a sample S from the sample set i Obtaining a search question Q i Tag vector T of i
Computing a tag vector T i And extension tag Z j Taking vocabulary texts corresponding to the top Hi expansion labels with the highest similarity to form a comparison group T;
calculating single search accuracy = number/Hi of the control group T equal to the number of vocabulary texts in the set Xi;
and traversing the whole sample set, and calculating the average accuracy as the accuracy of the search text.
Further, when the accuracy of the searched text is greater than the threshold value V, the current expanded label system is the final label system, otherwise, the label system is optimized.
The optimized label system can adopt one or a combination of several methods:
1) And adjusting the numerical values of the threshold values I, II, III and IV, and updating the extended label system until the accuracy of the search text of the current extended label system is greater than the threshold value V, so as to obtain the final label system.
2) And adjusting the legal vocabulary, the similarity calculation method of the tags and the accuracy calculation method of the search text, updating the extended tag system until the accuracy of the search text of the current extended tag system is greater than a threshold V, and obtaining a final tag system.
3) And taking the current extended label system as an object, calculating the accuracy of the search text after the removal of a certain label, if the accuracy is unchanged or increased compared with the accuracy obtained before the removal of the label, removing the label from the extended label system, and traversing all labels to obtain a final label system.
The second embodiment provides a judicial text label system construction system, which comprises a legal vocabulary module, a data acquisition module, a word segmentation module, a primary label construction module, an extension label module, a verification label module and an optimization label module, wherein,
the legal vocabulary module stores a legal vocabulary which comprises professional vocabularies related to judicial law;
the data acquisition module is used for acquiring the judicial texts and carrying out pretreatment;
the word segmentation module is used for adding the legal vocabulary into the general word segmentation tool and segmenting the judicial texts provided by the data acquisition module to obtain the judicial vocabulary texts;
the primary label building module is used for obtaining the judicial vocabulary text provided by the word segmentation module, counting the word frequency and the combined word frequency, and extracting the vocabulary and the combined vocabulary of which the word frequency and the combined word frequency meet a set threshold value II to serve as a primary label system;
the expansion tag module is used for storing an expansion tag dictionary, counting the similarity of tags in the primary tag system, combining the tags meeting a set threshold value III, extracting a corresponding expansion vocabulary from the expansion tag dictionary, adding the expansion vocabulary into the primary tag system and obtaining an expansion tag system;
the verification label module is used for storing a sample set and a search object set, wherein the sample set comprises a plurality of problem labels and a judicial vocabulary text set X related to the problems, the search object set comprises a plurality of judicial vocabulary text sets Y, the labels of the set Y are obtained by utilizing an extended label system, the problem labels are extracted from the sample set, and the accuracy of the vocabulary texts in the set Y and the vocabulary texts in the set X searched by utilizing the problem labels is counted;
the label optimizing module judges whether the accuracy provided by the verification label module meets a set threshold V or not, and if yes, the current label system is a final label system; if not, adjusting a set threshold II in the primary label building module, a set threshold III in the expanded label module and a set threshold IV.
Referring to fig. 1, a data processing flow of a judicial text label system construction system is as follows:
collecting about 16 ten thousand judicial texts of civil referee documents in the last 10 years, including marital referee documents and traffic referee documents, and performing data preprocessing, including: removing case details and judicial text data with empty applicable law fields, removing the judicial text data with the text length of the case details lower than a set case detail threshold value, removing repeated judicial text data, and independently extracting the case details, the applicable laws and the texts of specific judicial fields in the judicial text. The method comprises the steps of collecting a plurality of common civil laws 170, and extracting the text of the legal terms and the text of the specific specified two fields.
And (4) performing word segmentation, namely adding the legal vocabulary into a general word segmentation tool by using a word segmentation module, and segmenting the judicial texts after data preprocessing to obtain the judicial vocabulary texts.
And constructing a primary label, and extracting a vocabulary with the word frequency meeting a set threshold value as the primary label.
And the expansion tag extracts a corresponding expansion vocabulary from the expansion tag dictionary.
And verifying the tags, namely verifying through the corresponding relation between the legal case and the law, and contrasting the difference of the expansion tags of different versions in the searching accuracy.
Optimizing the label, judging whether the label after the label is verified meets the requirement, and if so, finishing the construction of a label system; if not, feeding back to the verification tag module.
While the present application has been described by way of examples, those of ordinary skill in the art will appreciate that there are numerous variations and permutations of the present application that do not depart from the spirit of the present application and that the appended embodiments are intended to include such variations and permutations without departing from the present application.

Claims (10)

1. A judicial text label system construction method is characterized by comprising the following steps:
acquiring a vocabulary text, wherein the vocabulary text refers to a form of representing a text by a vocabulary;
selecting candidate labels according to the word frequency and/or the combined word frequency of the vocabulary text to obtain a primary label system;
merging and/or expanding labels according to the similarity of the labels in the primary label system to obtain an expanded label system;
and determining that the final label system is constructed according to the accuracy of the text searched by the extended label system, wherein the accuracy of the text searched by the extended label system is calculated by the following steps: set up sample set and search object set, the sample set includes a plurality of problem label and the judicial vocabulary text set X relevant with the problem, search object set includes a plurality of judicial vocabulary text set Y, utilizes the extension label system acquires the label of set Y, follows the sample is concentrated and is extracted the problem label, and the statistics utilizes the problem label to search out the accuracy degree of the vocabulary text in the set Y and the vocabulary text in the set X.
2. The method of claim 1, wherein the obtaining of the lexical text comprises: constructing a judicial vocabulary, adding the judicial vocabulary into a custom dictionary of a word segmentation tool, and segmenting a judicial text to obtain a vocabulary text; wherein, the constructing the judicial vocabulary comprises:
adding the vocabulary of the legal dictionary and the vocabulary of the legal professional lexicon into a prepared vocabulary list;
counting the combined word frequency of the conventional words, and adding the conventional word combination with the combined word frequency meeting a set threshold I into the prepared vocabulary list as a new word;
rechecking, namely adding the unsingulated correct professional vocabulary into the prepared vocabulary;
and obtaining the judicial vocabulary.
3. The method for constructing a judicial text label system according to claim 1, wherein selecting candidate labels according to the vocabulary text word frequency and the combined word frequency to obtain a primary label system comprises:
defining the window length K, counting the occurrence times of any M vocabulary combinations by using a window traversal method, taking the vocabulary in the N combinations with the highest occurrence times as a keyword, counting the word frequency of a single vocabulary in the keyword, taking the vocabulary with the word frequency meeting a set threshold II as a candidate tag, and adding the candidate tag into the primary tag system.
4. The method for constructing a judicial text label system according to claim 1, wherein the similarity of labels is calculated by a method comprising:
setting a character-based label similarity weight p and a semantic-based label similarity weight q;
acquiring label similarity sim (W1, W2) of labels W1 and W2 based on characters, wherein sim (W1, W2) = the number of the same characters in label W1 and label W2/the larger value of the character length of label W1 and label W2;
obtaining a tag similarity score (W1, W2) of tags W1 and W2 based on semantics, wherein the score (W1, W2) is a correlation value of the tags W1 and W2, and the correlation value is obtained from a semantic model trained by using a judicial text as a corpus;
the similarity of the tags = p si m (W1, W2) + q score (W1, W2) was calculated.
5. The method of claim 1, wherein the tag is a tag of a judicial text,
the merging of the labels is specifically to merge the two labels when the similarity of the two labels meets a set threshold value III or the similarity of the two labels is R bits before the label similarity value of the primary label system, reserve one of the labels, and remove the other label from the primary label system;
specifically, when the similarity between a plurality of words and label words in a semantic model or a synonym dictionary meets a set threshold value IV, the words are used as the extension words of the label words, and the extension words are added into a primary label system.
6. The method for constructing a judicial text label system according to claim 1, wherein: the accuracy of the search text is calculated by the following method:
establishing a test set, wherein the test set comprises a sample set and a search object set, each sample of the sample set comprises a question, n cases most relevant to the question and m legal rules most relevant to the question, and the search object set comprises all case and legal rule sets;
extracting text labels of problems, cases and statutes in a sample set to form label vectors;
recommending cases similar to the problems and applicable laws in the search object set by using a vector matching method, wherein the vector similarity is calculated by using an Euler distance;
calculating accuracy by recommending case and law comparison of case and law with case and law corresponding to the sample set, wherein the accuracy is represented by an average value of recall rate and accuracy, the recall rate is also called recall rate, and the recall rate = the number of searched correct samples/the number of all correct samples in the data set; the accuracy rate is also called precision rate, and the accuracy rate = the number of correct samples found/the number of samples found in total.
7. The method for constructing a judicial text label architecture according to claim 1, wherein the accuracy of the search text is calculated by:
presetting a sample set and a search object set, wherein the sample set SS comprises NC samples and a sample S i Including a search question Q i And a vocabulary text collection X related to the search problem i Said set of lexical texts X i Comprising H i Word and phraseThis, X i ={x i1 ,x i2 ,…,x iHi }; the set of search objects Y comprises NS lexical texts, Y = { Y = { (Y) 1 ,y 2 ,…,y NS };
Obtaining the extension label Z, Z = { Z) of the search object set Y by using the extension label system 1 ,z 2 ,…,z NS };
Sequentially extracting a sample S from the sample set i Obtaining the search question Q i Tag vector T of i
Computing a tag vector T i And extension tag Z j Taking the vocabulary texts corresponding to the first Hi expansion tags with the highest similarity to form a contrast group T;
calculate single search accuracy = control T and set X i The number/Hi of the Chinese vocabulary texts is the same;
and traversing the whole sample set, and calculating the average accuracy as the accuracy of the search text.
8. The method for constructing a judicial text tag architecture according to claim 1, wherein determining that the final tag architecture is constructed according to the accuracy of the search text of the extended tag architecture comprises: when the accuracy of the searched text meets a set threshold V, the current expansion tag system is a final tag system, otherwise, the numerical values of thresholds I, II, III and IV are adjusted, the current expansion tag system is updated until the accuracy of the searched text of the updated expansion tag system meets the set threshold V, and the final tag system is obtained, wherein the threshold I is a combined word frequency which needs to be met by a conventional word combination which is used as a new word and added into a prepared vocabulary table, the threshold II is a word frequency which needs to be met by a word which is used as the candidate tag, the threshold III is the similarity which needs to be met by the tag combination, and the threshold IV is the similarity which needs to be met by an expansion word which is used as the tag and the tag.
9. The method of claim 1, wherein determining that the final tag architecture is complete according to the accuracy of the extended tag architecture search text comprises: and when the accuracy of the searched text meets a set threshold value V, the current expansion label system is the final label system, otherwise, the accuracy of the searched text after the removal of a certain label is calculated, if the accuracy is unchanged or increased compared with the accuracy obtained before the removal of the label, the label is removed from the expansion label system, all labels are traversed, and the final label system is obtained.
10. A judicial text label system construction system comprises a legal vocabulary module, a data acquisition module, a word segmentation module, a primary label construction module, an expansion label module, a verification label module and an optimization label module, wherein,
the legal vocabulary module stores a legal vocabulary which comprises professional vocabularies related to judicial law;
the data acquisition module acquires a judicial text and performs preprocessing;
the word segmentation module adds the legal vocabulary into a general word segmentation tool, and segments the judicial texts provided by the data acquisition module to obtain the judicial vocabulary texts;
the primary label building module is used for obtaining the judicial vocabulary text provided by the word segmentation module, counting word frequency and combined word frequency, and extracting vocabularies and combined vocabularies of which the word frequency and the combined word frequency meet a set threshold value II to serve as a primary label system;
the extended tag module is used for storing an extended tag dictionary, counting the similarity of tags in the primary tag system, combining the tags meeting a set threshold value III, extracting a corresponding extended vocabulary from the extended tag dictionary, and adding the extended vocabulary into the primary tag system to obtain an extended tag system;
the verification label module is used for storing a sample set and a search object set, wherein the sample set comprises a plurality of problem labels and a judicial vocabulary text set X related to problems, the search object set comprises a plurality of judicial vocabulary text sets Y, the labels of the set Y are obtained by using the extended label system, the problem labels are extracted from the sample set, and the accuracy of the vocabulary texts in the set Y and the vocabulary texts in the set X searched by using the problem labels is counted;
the optimized tag module judges whether the accuracy provided by the verification tag module meets a set threshold value V, and if the accuracy meets the set threshold value V, the current tag system is a final tag system; if not, adjusting a set threshold II in the primary label building module, a set threshold III in the extension label module and a set threshold IV.
CN201811294777.8A 2018-11-01 2018-11-01 Method and system for constructing judicial text label system Active CN109543178B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811294777.8A CN109543178B (en) 2018-11-01 2018-11-01 Method and system for constructing judicial text label system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811294777.8A CN109543178B (en) 2018-11-01 2018-11-01 Method and system for constructing judicial text label system

Publications (2)

Publication Number Publication Date
CN109543178A CN109543178A (en) 2019-03-29
CN109543178B true CN109543178B (en) 2023-02-28

Family

ID=65846358

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811294777.8A Active CN109543178B (en) 2018-11-01 2018-11-01 Method and system for constructing judicial text label system

Country Status (1)

Country Link
CN (1) CN109543178B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084290B (en) * 2019-06-13 2024-04-05 北京沃东天骏信息技术有限公司 Data retrieval method, device, equipment and storage medium
CN110675241A (en) * 2019-08-15 2020-01-10 上海新颜人工智能科技有限公司 Label calibration system and method
CN110929513A (en) * 2019-10-31 2020-03-27 北京三快在线科技有限公司 Text-based label system construction method and device
CN110928981A (en) * 2019-11-18 2020-03-27 佰聆数据股份有限公司 Method, system and storage medium for establishing and perfecting iteration of text label system
CN111177388B (en) * 2019-12-30 2023-07-21 联想(北京)有限公司 Processing method and computer equipment
CN113065312A (en) * 2020-01-02 2021-07-02 北京沃东天骏信息技术有限公司 Text label extraction method and device
CN111353045B (en) * 2020-03-18 2023-12-22 智者四海(北京)技术有限公司 Method for constructing text classification system
CN111221974B (en) * 2020-04-22 2020-08-14 成都索贝数码科技股份有限公司 Method for constructing news text classification model based on hierarchical structure multi-label system
CN111524043A (en) * 2020-04-24 2020-08-11 南京擎盾信息科技有限公司 Method and device for automatically generating litigation risk assessment questionnaire
CN111666771B (en) * 2020-06-05 2024-03-08 北京百度网讯科技有限公司 Semantic tag extraction device, electronic equipment and readable storage medium for document
CN112148868A (en) * 2020-09-27 2020-12-29 南京大学 Law recommendation method based on law co-occurrence
CN112365372B (en) * 2020-10-09 2024-01-12 银江技术股份有限公司 Quality detection and evaluation method and system for referee document
CN112925902B (en) * 2021-02-22 2024-01-30 新智认知数据服务有限公司 Method, system and electronic equipment for intelligently extracting text abstract from case text
CN113505192A (en) * 2021-05-25 2021-10-15 平安银行股份有限公司 Data tag library construction method and device, electronic equipment and computer storage medium
CN113948087B (en) * 2021-09-13 2023-01-17 北京数美时代科技有限公司 Voice tag determination method, system, storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004318381A (en) * 2003-04-15 2004-11-11 National Institute Of Advanced Industrial & Technology Similarity computing method, similarity computing program, and computer-readable storage medium storing it
JP2017078919A (en) * 2015-10-19 2017-04-27 日本電信電話株式会社 Word expansion device, classification device, machine learning device, method, and program
CN106682149A (en) * 2016-12-22 2017-05-17 湖南科技学院 Label automatic generation method based on meta-search engine
CN107577785A (en) * 2017-09-15 2018-01-12 南京大学 A kind of level multi-tag sorting technique suitable for law identification

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004318381A (en) * 2003-04-15 2004-11-11 National Institute Of Advanced Industrial & Technology Similarity computing method, similarity computing program, and computer-readable storage medium storing it
JP2017078919A (en) * 2015-10-19 2017-04-27 日本電信電話株式会社 Word expansion device, classification device, machine learning device, method, and program
CN106682149A (en) * 2016-12-22 2017-05-17 湖南科技学院 Label automatic generation method based on meta-search engine
CN107577785A (en) * 2017-09-15 2018-01-12 南京大学 A kind of level multi-tag sorting technique suitable for law identification

Also Published As

Publication number Publication date
CN109543178A (en) 2019-03-29

Similar Documents

Publication Publication Date Title
CN109543178B (en) Method and system for constructing judicial text label system
CN109189942B (en) Construction method and device of patent data knowledge graph
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN111966917B (en) Event detection and summarization method based on pre-training language model
CN106649818B (en) Application search intention identification method and device, application search method and server
CN111125334B (en) Search question-answering system based on pre-training
CN109829166B (en) People and host customer opinion mining method based on character-level convolutional neural network
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN109271524B (en) Entity linking method in knowledge base question-answering system
CN111191442B (en) Similar problem generation method, device, equipment and medium
CN110705247B (en) Based on x2-C text similarity calculation method
CN109255012B (en) Method and device for machine reading understanding and candidate data set size reduction
CN111191051B (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN113886604A (en) Job knowledge map generation method and system
CN110674296B (en) Information abstract extraction method and system based on key words
CN101923556B (en) Method and device for searching webpages according to sentence serial numbers
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN111488429A (en) Short text clustering system based on search engine and short text clustering method thereof
CN113220964B (en) Viewpoint mining method based on short text in network message field
CN111191413B (en) Method, device and system for automatically marking event core content based on graph sequencing model
CN111191029B (en) AC construction method based on supervised learning and text classification
CN117216187A (en) Semantic intelligent retrieval method for constructing legal knowledge graph based on terms
CN115757775B (en) Text inclusion-based trigger word-free text event detection method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 310012 1st floor, building 1, 223 Yile Road, Hangzhou City, Zhejiang Province

Applicant after: Yinjiang Technology Co.,Ltd.

Address before: 310012 1st floor, building 1, 223 Yile Road, Hangzhou City, Zhejiang Province

Applicant before: ENJOYOR Co.,Ltd.

GR01 Patent grant
GR01 Patent grant