CN107577785B

CN107577785B - Hierarchical multi-label classification method suitable for legal identification

Info

Publication number: CN107577785B
Application number: CN201710832304.8A
Authority: CN
Inventors: 柏文阳; 陈朋薇; 张剡; 周嵩
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2017-09-15
Filing date: 2017-09-15
Publication date: 2020-02-07
Anticipated expiration: 2037-09-15
Also published as: CN107577785A

Abstract

The invention discloses a hierarchical multi-label classification method suitable for legal identification, which comprises the following steps: step 1, extracting case facts and legal provisions thereof from a preprocessed referee document; step 2, expanding legal provisions corresponding to case facts based on the hierarchical structure of the label space, and enabling category labels of case samples to be a subset of the label space; step 3, performing word segmentation and part-of-speech tagging on case fact texts, performing feature selection on segmentation results, and selecting feature words capable of sufficiently representing case facts to construct feature vectors; step 4, constructing a prediction model: finding out a k neighbor sample set N (x) of the unseen example x in the extended multi-label training set, setting weight for each neighbor sample, calculating confidence coefficient of the unseen example belonging to each class according to the classification weight of the k neighbor samples to each class, and finally predicting a class label set of the unseen example.

Description

Hierarchical multi-label classification method suitable for legal identification

Technical Field

The invention belongs to the field of computer data analysis and mining, and relates to a hierarchical multi-label classification method suitable for legal identification.

Background

Hierarchical multi-label classification is a special case of multi-label classification. Unlike general multi-label classification, in a hierarchical multi-label classification problem, each sample can have multiple class labels, while the sample label space is organized in a tree or directed acyclic graph hierarchy. In the directed acyclic graph, one node may have a plurality of father nodes, which is more complex than a tree structure and has greater difficulty in designing an algorithm, so that the current research on hierarchical multi-label classification mainly aims at the category label structure of the tree. According to different modes of observing category hierarchical structures by algorithms, the hierarchical multi-label classification algorithm can be divided into a local algorithm and a global algorithm.

And (3) the local algorithm inspects the local classification information of each internal node in the category hierarchy one by one, and converts the hierarchical multi-label classification problem into a plurality of multi-label classification problems. And when training the multi-label classifier on the internal node, an appropriate local sample set needs to be selected. And in the prediction stage, a top-down equal prediction mode is adopted to enable the prediction result to meet the hierarchical requirement. Document ESULI A, FAGNI T, SEBASTIANI F.TreeBoost.MH A boosting algorithm for multi-label text classification [ C ]// String Processing and formatting retrieval 2006: 13-24. TreeBoost.MH algorithm is proposed to handle the hierarchical multi-label text classification problem. The algorithm recursively trains multi-label classifiers on each non-leaf node in the class label tree, the base classifier selects adaboost. The experimental effect proves that the TreeBoost.MH algorithm is better than the AdaBoost.MH algorithm in time efficiency and prediction performance. The documents CERRI R, BARROS R C, DE CARVALHO AC. hierarchical multi-layer-label using local neural networks [ J ]. Journal of Computer and System sciences,2014,80(1): 39-56, a local hierarchical multi-label classification algorithm based on a multi-layer perceptron is proposed, a multi-layer perceptron network is trained at each layer of the category hierarchy, each neural network is associated with a category hierarchy for predicting the category label at the hierarchy, and the prediction result of the neural network at a certain layer is used as the input of the neural network at the next layer. Because each layer of neural network is trained on the same sample set, the prediction result can not meet the hierarchical constraint, and the prediction result needs to be subjected to subsequent processing to ensure that the prediction result meets the hierarchical constraint.

The local algorithm has the disadvantages that on one hand, a plurality of classifiers are required to be trained, so that the model is relatively complex, and the understandability of the model is influenced; on the other hand, a blocking problem occurs in the prediction process, that is, the samples which are misclassified at the upper layer cannot reach the classifier at the lower layer, and although three strategies of reducing the threshold, limiting voting and expanding threshold multiplication are proposed to solve the blocking problem of the local algorithm, the local algorithm is not ideal in prediction accuracy.

The global algorithm considers the hierarchical structure of the category as a whole, trains a single-level multi-label classifier and predicts unseen examples. Global algorithms can be mainly classified into the following according to the way they process class label hierarchies: one global algorithm is to use class clustering to first compute the similarity of the test samples to each class and then classify the test samples into the closest class. The other method is to convert the hierarchical multi-label classification problem into a multi-label classification problem for processing: documents KIRITCHENKO S, MATWIN S, famill f.functional organization of genetic temporal organization [ J ],2005. extend the class labels of training samples, add their ancestor class labels, convert the hierarchical multi-label classification problem into a multi-label classification problem for processing. In the testing stage, the adopted multi-label classification algorithm AdaBoost.MH does not consider the hierarchical structure of the categories, so the same problem as that of a local algorithm is faced, namely, the predicted result has the situation of inconsistent hierarchies, and the output of the model also needs to be corrected to ensure that the hierarchy limit is met. There is also a global algorithm that adapts existing non-hierarchical classification algorithms to directly process hierarchical information and use the hierarchical information to improve performance. The literature VENS C, STRUFF J, SCHIETGAT L, et al, precision treeesfor hierarchical multilabelellar classification [ J ] Machine Learning,2008,73(2): 185-. Experimental results show that the global Clus-HMC algorithm is better than the Clus-SC and Clus-HSC algorithms in prediction performance and is better in time efficiency.

In general, global algorithms have two features: considering the hierarchical structure of the categories as a whole once; there is no modularity specific to the local algorithm. The key difference between the global algorithm and the local algorithm lies in the training process, and in the testing stage, the global algorithm can even use a top-down mode like the local algorithm to predict the category of the unseen instances.

Since the organization of class labels in the hierarchical multi-label classification problem is hierarchical, if a sample has a class label c_iThen the sample implicitly has c_iAll ancestor category labels of; on the other hand, in predicting the category of the unseen instance, the hierarchical constraint is also satisfied, i.e., it cannot happen that the unseen instance belongs to a category and not to an ancestor category of the category. A general hierarchical multi-label classification algorithm often cannot ensure that a prediction result meets the hierarchical limitation, or cannot obtain the optimal learning effect because the hierarchical structure features of a label space are not utilized. Therefore, the hierarchical multi-label classification algorithm not only needs to make full use of the association and the hierarchical structure between the class labels to improve the prediction performance of the classification model, but also needs to enable the prediction result to meet the hierarchical limitation.

The problem of automatic identification of case applicable laws is essentially a hierarchical multi-label classification problem, the category labels of the samples, namely, the legal provisions applicable to the cases, are organized in a tree structure, one case may be applicable to a plurality of legal provisions, and the specific degrees of the legal provisions applicable to the case may be different. The corresponding hierarchical multi-label classification algorithm for solving the problem of automatic identification of case applicable law needs to be capable of processing tree-shaped class hierarchical structures, and is a non-mandatory leaf node prediction algorithm, and predicted class labels can correspond to any nodes in the class hierarchical structures.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the technical problem of providing an effective hierarchical multi-label classification method suitable for legal identification aiming at the defects of the prior art.

The technical scheme is as follows: the invention discloses a hierarchical multi-label classification method suitable for legal identification, which comprises the following steps:

step 1, crawling a required referee document original text data set from the Internet by using a crawler technology based on a jsup, wherein one referee document corresponds to one sample, and randomly dividing the referee document into a training sample set and a testing sample set according to a ratio of 7: 3. Then, preprocessing the official document: extracting case facts and applicable legal provisions thereof from the case facts according to a literary structure of a referee document, wherein the case facts are used for generating feature vectors of case samples, the case samples comprise training samples and test samples, the applicable legal provisions are used for representing class labels of the case samples, an original text data set is converted into a semi-structured multi-label training sample set and a semi-structured test sample set, and the semi-structured sample form is as follows: (case fact description, legal provisions text); correcting errors and format inconsistency in case-applicable legal provisions; the language processing method is characterized in that a language technology platform LTP of Hadamard is used as a language processing tool (the LTP is a whole set of Chinese language processing system, a language processing result representation based on XML is formulated, and a whole set of abundant and efficient Chinese language processing modules (including six Chinese processing core technologies such as lexical, syntactic and semantic) from bottom to top, an application program interface and a visualization tool based on a Dynamic Link Library (DLL) and capable of being used in a network service mode) are provided on the basis of the language processing result representation based on XML to perform word segmentation and part of speech tagging on case fact description.

And 2, because the organization of the legal provision in the legal system is in a tree structure, correspondingly, the label space formed by the category labels in the multi-label training set is in a tree structure. Based on a hierarchical structure of a label space formed by category labels in a multi-label training sample set, legal provisions corresponding to case facts of all case samples are expanded, so that the category label corresponding to each case fact is a subset of the label space and meets the hierarchical limitation;

step 3, performing feature selection on the word segmentation result from the training set in the step 1 (the word segmentation result refers to the case fact part of the semi-structured multi-label training set in the step 1), and selecting feature words capable of fully representing the case fact to construct feature vectors; obtaining a structured extended multi-label training sample set Tr and an extended multi-label testing sample set Te through text representation;

step 4, constructing a prediction model: finding k neighbor sample sets N (x) of unseen examples x from the extended multi-label test sample set Te in the extended multi-label training sample set Tr, setting weight for each neighbor sample, calculating confidence coefficients of the unseen examples x belonging to each category in the label space according to the classification weight of the k neighbor samples to each category in the label space, predicting category label sets h (x) of the unseen examples x, and h (x) meeting the hierarchical constraint. And finally, removing the hierarchical restriction in the category label set h (x) of the predicted unseen example x according to the tree structure of the label space (namely, the reverse process of label expansion) to obtain the specific applicable legal provisions of the unseen example. .

The step 2 comprises the following steps:

step 2-1, in the hierarchical multi-label classification problem, a d-dimensional instance space is given

(Is a real number set) and a label space Y containing q classes { Y ═ Y₁,y₂,…,_q}，y_vThe label of the v-th class is represented, v is more than or equal to 1 and less than or equal to q, the spatial hierarchy of the class label can be represented by a binary group (Y and less), the < represents the partial order relation of the class label, if Y exists_v,y_uE is Y and Y_v＜y_uThen class label y_vBelongs to category label y_u，y_vIs y_uDescendant class tag of, y_uIs y_vThe ancestor class label, < represents the partial order relationship of the class label, the partial order relationship < can be understood as "belonging to" the relationship, i.e., if there is y_v,y_uE is Y and Y_v＜y_uThen class label y_vBelongs to category label y_u，y_vIs y_uDescendant class tag of, y_uIs y_vThe partial order relationship < has asymmetry, non-reflexivity and transitivity, and can be described by the following four characteristics:

a) the only root node in the class label hierarchy is represented by a virtual class label R for any y_iE is Y, has Y_i＜R；

b) For any y_i,y_jE.g. Y, if there is Y_i＜y_jThen, then

c) Arbitrary y_iIs e.g. Y, has

d) Arbitrary y_i,y_j,y_k∈Y，y_i＜y_jAnd y is_j＜y_kThen there is y_i＜y_k。

The multi-label classification problem in which the organizational structure of the category labels satisfies the above four features can be regarded as a hierarchical multi-label classification problem. As can be seen from the above formal definitions, in the hierarchical class label space, all other class nodes (excluding the start node) on the unique path formed by tracing back from any class node to the root node are ancestor class nodes of the class node. Thus if the sample has a class label y_iThen the sample implicitly also has y_iAll ancestor class labels of, this requires that the classifier must satisfy a hierarchical constraint on the set of class labels h (x) that predict unseen instance x, i.e.,

and y '< y': y ', (x) where y' is the class label in h (x) and y 'is an ancestor class label for y';

step 2-2, for each multi-labeled case sample (x)_i,h_i) I is more than or equal to 1 and less than or equal to m, m is the number of all the obtained referee document samples, x_ie.X is a feature vector with d dimension for representing the fact part of the case,

is equal to x_iA corresponding set of class labels, i.e. x_iCorresponding legal provision, the expanded category label set is h_i′，Then h is_iIn' contains h_iAll category labels in (1) and all ancestor category labels thereof. In a formalized way, the method comprises the following steps of,

wherein y' is h_iThe ancestor class label of class label y in (1), y ∈ h_i；

The label extension process explicitly expresses the hierarchical relationship of the category labels in the category labels of the sample: if a sample is marked as a certain category, then the ancestor categories of the categories are also explicitly assigned to the sample through label expansion; the category label of each sample can be viewed as a subtree of the label space tree, and the top level of each subtree is the root node. It can be seen that if there is y_i,y_jE is Y and Y_i＜y_jFor example, in the k neighbor sample in the extended multi-label training set, there is a class label y_iMust not be less than having a class label y_jThe number of samples of (1). The label expansion is an important step for ensuring that the prediction result of the learning algorithm meets the level limit.

The step 3 comprises the following steps:

and 3-1, the purpose of feature selection is to reduce dimensions of features, and since a common text feature selection algorithm cannot directly process a multi-label data set, multi-label sample data needs to be converted into single-label sample data for processing. The conversion method comprises the following steps: for each multi-labeled case sample (x)_i,h_i) (i is more than or equal to 1 and less than or equal to m) using | h_iI represents a category label set h of a multi-labeled case sample_iThe number of label categories in the label list is replaced by | h_iL single-label case sample (x)_i′,y_i′)(1≤i′≤|h_i|,y_i′∈h_i) Class label y for each single label sample_i′I.e. the category label set h_iA category label of; the multi-label case samples comprise multi-label training samples and multi-label testing samples; table 1 gives an example of converting a multi-label exemplar to a single label exemplar according to the above strategy.

TABLE 1 Multi-tag sample conversion Process

And 3-2, converting the multi-label case sample into a case sample with a plurality of single labels through the conversion process of the step 3-1, performing feature selection on the word segmentation result obtained from the original training set in the step 1 by using a general feature selection algorithm, selecting a certain number of feature words with distinguishing capability (usually, the total information gain of the selected feature words is as large as possible and the number of the feature words is not too much, for example, when the feature selection is performed by using an information gain algorithm, at least 100 feature words are generally selected) to form a feature space, and using the feature words from the feature space to represent the case fact part of each case sample. Wherein, the attribute value corresponding to each feature word, namely the feature weight, is calculated by adopting a commonly used TF-IDF algorithm. And (3) regarding the case fact part of each case sample with the single label as a document with words segmented, and then combining the case fact parts of all case samples with the single label into a document set. Feature weight tf-idf of j-th dimension feature in ith document in document set_i″j″The definition is as follows:

wherein tf-idf_i″j″Representation feature word t_j″In document d_i″Frequency of occurrence of idf_j″Representation feature word t_j″Inverse document frequency in a document collection, N represents the total number of documents in the document collection, N_j″Representation feature word t_j″Frequency of documents in a document set, i.e. occurrence of a characteristic word t in a document set_j″The denominator of the number of documents is a normalization factor;

and 3-3, performing feature selection on the word segmentation result in the step 1 by using an information gain algorithm or a chi-square statistical algorithm, and selecting about 100 feature words with the highest distinguishing capability to form a feature vector. Commonly used text feature selection methods are mainly based on Document Frequency (DF), Mutual Information (MI), Information Gain (IG), chi-square statistics (χ)²Statistical, CHI) and the like. The feature selection based on the document frequency is too simple, the feature words with the most classified information cannot be selected, and the mutual information has the defect that the mutual information is easily influenced by the marginal probability of the feature words, so that the hierarchical multi-label classification method selects information gain or chi-square statistical algorithm to select the features.

Step 3-3 comprises: and (3) selecting features by adopting an information gain algorithm: the information gain ig (t) of the feature word t is defined as follows:

wherein, p (y)_v) Presentation category label y_vProbability of occurrence, p (t) represents the probability of occurrence of the feature word t,indicating a category label y on the premise of the occurrence of a characteristic word t_vThe probability of occurrence of the event is,

indicates the probability that the feature word t does not occur,

indicating a category label y on the premise that the feature word t does not appear_vThe probability of occurrence, for each feature word in the document set, calculating the information gain, the information gain value is lower than the set threshold (for example, 0.15, the threshold is set so that the total information gain of the selected feature words is as large as possible and the number of feature words is not largeToo many) feature words are not included in the feature space;

step 3-3 can also adopt chi-square statistical algorithm to carry out feature selection: it is assumed that the feature words are not related to class, and if the test value calculated using the CHI distribution deviates more from the threshold, then there is more confidence in negating the original hypothesis, accepting an alternative hypothesis to the original hypothesis: i.e. the characteristic words have a high degree of correlation with the categories.

Let A be a label y containing a feature word t and belonging to a category _v1 < v < q, B is a label y containing a feature word t but not belonging to a category_vC is a label y belonging to a category without including a feature word t_vD is a label y which does not contain the feature word t and does not belong to the category_vN is the total number of documents in the document set, then the characteristic word t and the category label y_vChi square statistic of²(t,y_v) Is defined as:

characteristic word t and category y_vWhen independent, the chi-square statistic is 0, the chi-square statistic about each category is calculated for one feature word, and then the mean χ is calculated respectively² _avg(t) and maximum value

By Chi² _avg(t) and

taking into account, a certain number (about 100) of feature words with differentiating capability are selected, wherein p (y)_v) Presentation category label y_vProbability of occurrence:

the main advantage of the chi-squared statistical feature selection algorithm over mutual information is that it is a normalized value, and therefore can better scale different feature words in the same category.

In step 4, when k neighbors are found, no example x and no neighbor sample in the extended multi-label training sample set are found(x_a,h_a) Distance d (x, x)_a) Wherein (x)_a,h_a)∈N′(x)，1≤a≤k，，h_aIs x_aAnd calculating the reciprocal of cosine similarity of the feature vectors of the corresponding class labels, wherein the cosine similarity cos (gamma, lambda) of the feature vector gamma of the example x and the feature vector lambda of the adjacent sample is calculated by the following formula:

where S denotes the index of the component of the feature vector, i.e. the position of the component in the feature vector, S denotes the dimension of the feature vector, γ_sDenotes the s-th component, λ, of the feature vector γ_sRepresenting the s-th component of the feature vector lambda.

In step 4, d (x, x) is used_a) Indicating that example x is not found and the neighboring samples (x) in the extended multi-labeled training sample set_a,h_a) The distance of the extended multi-label training sample set is calculated by adopting a full label distance weight method or an entropy label distance weight method to h_aClass label y in (1)_jBy classification weight w_aj，1≤j≤q；

Calculation of w by full-label distance weight method_aj：

Calculation of w by entropy label distance weight method_aj：

Unseen instance x belongs to the class label y_jConfidence of (c) (x, y)_j) The calculation formula is as follows:

wherein w_arRepresents h_aR ofIdentification label y_rThe classification weight of (2);

the class label set h (x) for the prediction of missing instance x is:

and selecting 0.5 as a decision threshold, and when the confidence of the unseen example x belonging to each class label is smaller than the decision threshold, returning the class label with the maximum confidence as the class label of the unseen example.

As a hierarchical multi-label classification method, the prediction result thereof needs to satisfy the hierarchical constraint, that is,

and y '< y' ∈ h (x). The following is a demonstration: from the confidence calculation formula, if the algorithm predicts that no instance x has a class label y_a(y_aE Y), then x belongs to category Y_aConfidence of (c) (x, y)_a) Greater than the threshold t or maximum in all categories. Investigation class y_aAncestor class y of_b(y_b∈Y,y_a＜y_b) If y is_bCorresponding to the virtual root node in the category hierarchy, then x has a category label y_aClearly meeting the hierarchical constraint; otherwise, for any neighbor sample of x (x)_i,y_i) e.N (x), if y_a∈Y_iThen y is also present_b∈Y_iAnd otherwise, the result is not necessarily true, and the label extension process of the training set ensures that the conclusion is true. Therefore, with the full tag distance weight method and the entropy tag distance weight method, it can be derived:

on the denominator

max_1≤r≤qw_irRemain unchanged, so x belongs to class y_bConfidence of (c) (x, y)_b) X belongs to the category y_aConfidence of (c) (x, y)_a) If there is c (x, y)_a)>t, must also have c (x, y)_b)>t, so the prediction results satisfy the hierarchical constraint.

Finally, the performance evaluation index of the learning method adopts a hierarchical evaluation index: the hierarchical precision (hP), the hierarchical recall (hR) and the hierarchical F metric (hF) are defined as follows:

wherein the content of the first and second substances,

is the set of classes to which the predicted test sample i belongs and its ancestor classes,

is the set of classes to which the test sample i actually belongs and its ancestor classes, and the summation operation is to compute the values over all test samples.

In order to make the identification of case applicable laws more practical, the target category predicted by the algorithm is preferably a specific legal provision, not just a broad law, so the method considers the prediction performance of the target category in both cases of the whole legal provision and the specific legal provision. Hereinafter, the hierarchical precision, recall rate and F metric value of the system when the target category is all legal provisions are denoted by hP _ all, hR _ all and hF _ all, respectively, and the hierarchical precision, recall rate and F metric value of the algorithm when the target category is a specific legal provision are denoted by hP _ partial, hR _ partial and hF _ partial.

Besides the hierarchical evaluation index, the precision, the recall rate and the F metric value of each category can be calculated respectively, and the average value of the precision, the recall rate and the F metric value of all the categories is used as the evaluation index of the system performance, namely Macro-averaging (Macro-averaging) of the precision, the recall rate and the F metric value. For each category, let TP denote the number of true positive examples, FP denote the number of false positive examples, TN denote the number of true negative examples, and FN denote the number of false negative examples, the formula for computing the Macro-average Macro-P, Macro-R, Macro-F of accuracy, recall rate, and F value is as follows:

the invention relates to a global hierarchical multi-label classification method, which considers the hierarchical structure of class labels on the whole and ensures that the prediction result also meets the hierarchical limitation. The learning method is an inertia learning algorithm, a clear prediction model is not required to be constructed on a training set, and only the original multi-label sample is subjected to label expansion and then stored, so that incremental learning is supported; in the prediction stage, k adjacent samples of the unseen examples in the training set are firstly found, the confidence coefficient of the examples belonging to each class is determined according to the classification weight of the adjacent samples to each class, and then the class of the unseen examples is predicted. The learning method is simple in model, supports incremental learning, and can be well applied to automatic identification of the problem of multi-level multi-label classification which contains massive data and continuously increases data in case-applicable law.

Has the advantages that: the hierarchical multi-label classification method suitable for legal recognition provided by the invention fully considers the tree-shaped hierarchical structure of the legal provision label space on the whole, so that the prediction result meets the hierarchical limitation, and the prediction result does not need to be additionally corrected. Meanwhile, the method is simple in model, supports incremental learning, and can be well applied to automatic identification of the problem of multi-level multi-label classification which contains massive data and continuously increases data in case-applicable law.

Drawings

The foregoing and other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

FIG. 1 is a main flow chart of the present invention.

FIG. 2 is a sample official document.

FIG. 3 legal provisions tag space tree structure.

The legal provisions of fig. 4 combine frequency distributions.

Fig. 5 shows the performance comparison of the hierarchical indexes under different neighbor numbers.

Figure 6 compares macro average indicator performance for different neighbor numbers.

FIG. 7 is a comparison of performance of indexes under different weighting strategies.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

The invention discloses a hierarchical multi-label classification method suitable for legal identification, which comprises the following steps:

step 1, crawling a required referee document original text data set from the Internet by using a crawler technology based on jsup, and randomly dividing the referee document original text data set into a training set and a testing set according to a ratio of 7: 3. Then, preprocessing the official document, and mainly completing the following work:

extracting case facts and applicable legal provisions thereof from the case facts according to a literary structure of a referee document, wherein the case facts are used for generating feature vectors of case samples, and the applicable legal provisions are used for representing category labels of the case samples, and converting an original text data set into a semi-structured multi-label training set and a semi-structured testing set;

correcting errors and format inconsistency in case-applicable legal provisions;

and utilizing a language technology platform LTP of the Hadamard to perform word segmentation and part of speech tagging on the case fact description.

And 2, because the organization of the legal provision in the legal system is in a tree structure, correspondingly, the label space formed by the category labels in the multi-label training set is in a tree structure. Based on the hierarchical structure of the label space, expanding legal provisions corresponding to case facts of all samples, and enabling a category label set corresponding to each case fact to be a subset of the label space and meet the hierarchical limitation;

step 3, performing feature selection on the word segmentation result obtained from the original training set in the step 1, and selecting feature words capable of sufficiently representing case facts to construct feature vectors; obtaining a structured extended multi-label training set Tr and a test set Te through text representation;

step 4, constructing a prediction model: finding k neighbor sample sets N (x) of unseen examples x from the extended multi-label test set Te in the extended multi-label training set Tr, setting weight for each neighbor sample, calculating confidence coefficients of the unseen examples belonging to all classes in the label space according to the classification weight of the k neighbor samples to all classes in the label space, predicting class label sets h (x) of the unseen examples, and h (x) meeting the hierarchical constraint. And finally, removing the hierarchical restriction in the prediction type set h (x) (namely the inverse process of label expansion) according to the tree structure of the label space to obtain the specific applicable legal provisions of the unseen examples.

The step 2 comprises the following steps:

And a label space Y containing q classes { Y ═ Y₁,y₂,…,y_q}，y_iRepresenting the ith class, the class label spatial hierarchy can be represented by a binary (Y, <) < representing a partial ordering relationship for the class labels, which can be understood as "belonging to" the relationship, i.e., if there is Y_i,y_jE is Y and Y_i＜y_jThen class y_iBelong to the category y_j，y_iIs y_jDescendant class of, y_jIs y_iThe ancestor class of (1). The partial order relationship < has asymmetry, non-reflexibility and transitivity, and can be described by the following four characteristics:

e) the only root node in the class label hierarchy is represented by a virtual class label R for any y_iE is Y, has Y_i＜R；

f) For any y_i,y_jE.g. Y, if there is Y_i＜y_jThen, then

g) Arbitrary y_iIs e.g. Y, has

h) Arbitrary y_i,y_j,y_k∈Y，y_i＜y_jAnd y is_j＜y_kThen there is y_i＜y_k。

The multi-label classification problem in which the organizational structure of the category labels satisfies the above four features can be regarded as a hierarchical multi-label classification problem. As can be seen from the above formal definitions, in the hierarchical class label space, all other class nodes (excluding the start node) on the unique path formed by tracing back from any class node to the root node are ancestor class nodes of the class node. Thus if the sample has a class label c_iThen the sample implicitly has c_iAll ancestor class labels of, this requires that the classifier also satisfy the hierarchy constraint for the set of prediction classes for unseen instances, h (x), i.e.,

and y '< y' ∈ h (x).

Step 2-2, for any training sample (x)_i,y_i) (i is more than or equal to 1 and less than or equal to m), m is the number of all the obtained referee document samples, x_iE X is dThe feature vector of the dimension(s),

is equal to x_iA corresponding set of category labels. Let the expanded category label set be y_i', then y_i' therein contains y_iAll category labels in (1) and all ancestor category labels thereof. In a formalized way, the method comprises the following steps of,

the label extension process explicitly expresses the hierarchical relationship of the category labels in the category labels of the sample: if a sample is marked as a certain category, then the ancestor categories of the categories are also explicitly assigned to the sample through label expansion; the category label of each sample can be viewed as a subtree of the label space tree, and the top level of each subtree is the root node. It can be seen that if there is y_i，y_jE is Y and Y_i＜y_jFor example, in the k neighbor sample in the extended multi-label training set, there is a class label y_iMust not be less than having a class label y_jThe number of samples of (1). The label expansion is an important step for ensuring that the prediction result of the learning algorithm meets the level limit.

The step 3 comprises the following steps:

and 3-1, the purpose of feature selection is to reduce dimensions of features, and since a general text feature selection algorithm cannot directly process a multi-label data set, multi-label data needs to be converted into single-label data for processing. The conversion method comprises the following steps: for each multi-label sample (x, h), the number of label categories in the label category set h is represented by | h |, and is replaced by | h | new single-label samples (x, y)_i)(1≤i≤|y|，y_iE h), class y for each new sample_iThat is, a category label in the original multi-label sample category label set h, table 1 gives an example of converting a multi-label sample into a single-label sample according to the above strategy.

TABLE 1 Multi-tag sample conversion Process

And 3-2, converting the multi-label case sample into a single-label case sample through the conversion process of the step 3-1, performing feature selection on the word segmentation result obtained from the original training set in the step 1 by using a general feature selection algorithm, and selecting about 100 feature words with the highest distinguishing capability to form a feature space. And (3) representing the case fact part of each case sample by using feature words from the feature space, wherein the attribute value, namely the feature weight, corresponding to each feature word is calculated by adopting a common TF-IDF algorithm. And considering the case fact part of each sample as a document with words segmented, the case fact parts of all samples form a document set. Feature weight tf-idf of jth dimension feature in ith document_ijThe definition is as follows:

wherein, tf_ijRepresentation feature word t_jIn document d_iFrequency of occurrence of idf_jRepresentation feature word t_jInverse document frequency in a document collection, N represents the total number of documents in the document collection, N_jRepresentation feature word t_jFrequency of documents in a document set, i.e. occurrence of a characteristic word t in a document set_jThe denominator is the normalization factor.

And 3-3, performing feature selection on the word segmentation result obtained from the original training set in the step 1, and selecting a certain number of feature words with distinguishing capability to form feature vectors. Commonly used text feature selection methods are mainly based on Document Frequency (DF), Mutual Information (MI), Information Gain (IG), chi-square statistics (χ)²Statistical, CHI) and the like. The feature selection based on the document frequency is too simple, the feature words with the most classified information cannot be selected frequently, and the mutual information has the defect that the mutual information is easily influenced by the marginal probability of the feature words, so that the information is selected by the hierarchical multi-label classification methodAnd carrying out feature selection by using a gain or chi-square statistical algorithm.

wherein, P_r(y_i) Represents a category y_iProbability of occurrence, P_r(t) represents the probability of occurrence of the feature t, P_r(y_iI t) represents the category y on the premise that the feature t appears_iThe probability of occurrence of the event is,

indicating the probability that the feature t does not occur,

indicating class y without the occurrence of feature t_iThe probability of occurrence. And calculating the information gain of each feature word in the document set, wherein the feature words with the information gain value lower than a set threshold value are not included in the feature space.

Step 3-3 can also adopt the chi-square statistical algorithm to carry on the feature selection to the case fact text in the training set: it is assumed that the feature words are not related to class, and if the test value calculated using the CHI distribution deviates more from the threshold, then there is more confidence in negating the original hypothesis, accepting an alternative hypothesis to the original hypothesis: i.e. the characteristic words have a high degree of correlation with the categories. Let A be the number of documents containing a feature word t and belonging to category y, B be the number of documents containing a feature word t but not belonging to category y, C be the number of documents not containing a feature word t but belonging to category y, D be the number of documents not containing a feature word t but not belonging to category y, and N be the total number of documents, then chi-square statistic χ of feature word t and category y²(t, y) is defined as:

characteristic word t and category y are uniqueAt that time, the chi-square statistic is 0, the chi-square statistic about each category is calculated for one feature word, and then the mean χ is calculated respectively² _avg(t) and the maximum value χ² _max(t), comprehensively considering the two modes, and selecting the most distinguishing characteristic words:

χ² _avg(t)＝∑_i＝1P_r(y_i)χ²(t,y_i)，

X² _max(t)＝max_i＝1,...,qX²(t,y_i)。

P_r(y_i) Represents a category y_iThe probability of occurrence. The main advantage of the chi-squared statistical feature selection algorithm over mutual information is that it is a normalized value, and therefore can better weigh different feature words in the same category.

In step 4, when k is found to be close to the neighbor, no example x and no sample (x) are found_i,h_i) Distance d (x, x)_i) The inverse of the cosine similarity of their feature vectors is used for the measurement. The cosine similarity cos (γ, λ) between the feature vector γ of the unseen example and the feature vector λ of the neighboring sample is calculated as follows:

where S denotes the index of the vector component, i.e. the position of the component in the vector, S denotes the vector dimension, γ_sDenotes the s-th component of the vector y, λ_sRepresenting the s-th component of the vector lambda.

In step 4, d (x, x) is used_i) Show example x and sample (x)_i,h_i) Using full label distance weight method to calculate sample ((x)_i，h_i) e.N (x)) for class y_jBy classification weight w_ij：

Calculation of w by full-label distance weight method_ij：

Calculation of w by entropy label distance weight method_ij：

Unseen examples belong to category y_jConfidence of (c) (x, y)_j) The calculation formula is as follows:

and selecting 0.5 as a decision threshold, and returning the class with the highest confidence as the class to which the unseen instance belongs when the confidence of the unseen instance belonging to each class is less than the decision threshold.

Examples

As shown in fig. 1, the steps of the present invention are:

step one, crawling a required referee document original text data set from the Internet by using a crawler technology based on a jsup, and randomly dividing the referee document original text data set into a training set and a testing set according to a ratio of 7: 3. Then, preprocessing the official document, and mainly completing the following work:

correcting errors and format inconsistency in case-applicable legal provisions;

Expanding legal provisions corresponding to case facts of all samples based on a hierarchical structure of a label space, so that a category label corresponding to each case fact is a subset of the label space and meets the hierarchical limitation;

step three, performing feature selection on the word segmentation result obtained from the original training set in the step 1, and selecting feature words capable of sufficiently representing case facts to construct feature vectors; obtaining a structured extended multi-label training set Tr and a test set Te through text representation;

step four, constructing a prediction model: firstly, finding out k neighbor sample sets N (x) of unseen examples x from an extended multi-label test set Te in an extended multi-label training set Tr, setting weight for each neighbor sample, calculating confidence coefficients of the unseen examples belonging to each category in a label space according to the classification weight of the k neighbor samples to each category in the label space, predicting category label sets h (x) of the unseen examples, and h (x) meeting the hierarchical constraint. And finally, removing the hierarchical restriction in the prediction type set h (x) (namely the inverse process of label expansion) according to the tree structure of the label space to obtain the specific applicable legal provisions of the unseen examples.

The implementation data is obtained from official documents of people's court at all levels of Zhejiang province published by Zhejiang court.

FIG. 2 is a sample official document in which the straight underline marked portion is the case fact portion and the curved underline marked portion is the applicable legal provisions for the case. And extracting case facts and legal provisions thereof according to the law of the official document. The pretreatment work is mainly the cleaning and the correction of the applicable legal part of the case.

In fig. 3, a tree structure of a legal provision tag space is shown. Based on the hierarchical structure, the legal provision corresponding to each case fact is subjected to label expansion.

Fig. 4 is a legal provision combination histogram. According to the frequency of citation of each legal provision, 26 laws such as ' people's republic of China ' litigation law, ' people's republic of China ' and ' 451 specific legal provisions contained in the laws are selected as category labels to form a label space, namely, the dimension of the label space is 477. The set of category labels for each case sample is represented in the form of a label vector, each dimension of which represents a category label in the label space, i.e., a complete legal provision. If a case is applicable to a certain legal provision, the corresponding label entry values of the legal provision and all legal provisions containing the legal provision in the label vector are both 1, otherwise, the corresponding label entry values are 0. Therefore, the label vector of each sample corresponds to one legal provision combination, the frequency of occurrence of each combination is the number of corresponding case samples, and the frequency of occurrence of each legal provision combination can also reflect some properties of the case sample set. By calculating each, and selecting the combination with higher frequency of occurrence and arranging it in order from large to small, fig. 4 can be obtained. As can be seen from the figure, the occurrence frequency of the legal provision combinations is approximately in a long tail distribution, the occurrence frequency of a few legal provision combinations is extremely high, which indicates that a large number of case samples are suitable for the legal provision combinations, and in addition, the occurrence frequency of most legal provision combinations is relatively balanced.

And step three, selecting an information gain algorithm to select characteristics. Through calculating the information gain of each feature word, it can be found that most words with higher information gain are verbs or nouns, and table 2 shows the proportion of verbs and nouns in the feature words with the highest information gain value, so that the nouns and verbs have higher distinguishing capability in the problem of legal identification compared with words with other properties, and on the other hand, the words except the verbs and nouns in the text can be removed through part-of-speech tagging, so that the number of words in the text is reduced, and the subsequent calculation is simplified.

Table 2 ratio of verb nouns in feature words:

number of feature words	Verb noun number ratio	Verb noun information gain total proportion
			100	88.0％	87.9％
200	80.0％	82.3％
			300	81.0％	82.5％
400	80.5％	82.0％
			500	76.8％	79.7％

Table 3 summary of experimental training set and test set:

	number of samples	Sample average class label number
			Training set	102608	7.6344
Test set	44210	7.6397

Fig. 5 and 6 are comparisons of the performance of the hierarchical index and the macro-average index when different numbers of neighbors are taken.

As can be seen from fig. 5: when the number of the neighbors is an even number, the precision of the algorithm is high, and the recall rate is low; when the number of the neighbors is odd, the precision of the algorithm is low, and the recall rate is high. This distinction becomes progressively smaller as the number of neighbors increases. This phenomenon can be explained by analyzing the principles of the algorithm: the decision threshold set by the algorithm is 0.5, and when the number of neighbors is even, only the class label with the occurrence frequency exceeding k-2 is predicted as the class label of the unseen instance due to the addition of the smoothing parameter, and the class label with the occurrence frequency just equal to k-2 is not endowed with the unseen instance. Therefore, when the number of neighbors is even, the condition that each class label endows the unseen instance is more severe, so that the prediction precision of the algorithm is higher, and the recall rate is correspondingly lower. This effect is gradually reduced as the number of neighbors increases, and thus the difference becomes smaller. It can also be seen from the figure that when the target category is all legal provisions, each prediction index of the algorithm is higher than that when the target category is a specific legal provision. This is because broader legal categories contain more case samples, thus making the model more predictive in these categories. In summary, when the k value of the number of neighbors is 5, the comprehensive prediction performance of the algorithm is the best.

From fig. 6 it can be found that: as the number of neighbors increases, the macro-average precision, recall ratio and F metric value of the algorithm all decrease. The reason for this may be that as the number of neighbors increases, it is more difficult for the classes with fewer samples to reach the decision threshold, thus leading to a decrease in the prediction performance of most classes and ultimately to a decrease in the corresponding macro average performance.

Fig. 7 shows the performance of the algorithm on each evaluation index when the number of the fixed neighbors is 5 and the sample weight strategy is a full label distance weight method and an entropy label distance weight method, respectively. In summary, regardless of hierarchical indexes or macro-average indexes, the entropy label distance weighting strategy can achieve better effect on precision, and the full label distance weighting strategy can achieve better effect on recall rate and F metric value. The entropy label weight strategy is biased to samples with fewer class labels, and in the expanded hierarchical multi-label samples, the more specific the class to which the sample belongs, the more class labels, the smaller the classification weight under the entropy label weight strategy, so that the prediction result is more biased to the upper class by adopting the entropy label weight strategy, and the larger the generalization error. Although the algorithm has a decline in performance when the target category is a specific legal provision, there is still a hierarchical accuracy close to 80% and a hierarchical recall rate of more than 65%, indicating that case-applicable legal identification based on the present hierarchical multi-label classification algorithm is valid.

In consideration of the two cases that the target category is all legal provisions and specific legal provisions, the macro average precision, recall rate and F metric value of the algorithm when the target category is all legal provisions are respectively represented by mP _ all, mP _ all and mP _ all, and the macro average precision, recall rate and F metric value of the algorithm when the target category is specific legal provisions are represented by mP _ partial, mP _ partial and mP _ partial.

Two common hierarchical multi-label classification algorithms, namely a TreeBoost.MH local algorithm and a Clus-HMC global algorithm, are selected respectively in the implementation and are compared with the prediction performance of the hierarchical multi-label classification algorithm, the performance comparison of the hierarchical multi-label classification algorithm on each hierarchical index is given in a table 5, and the prediction performance comparison of the hierarchical multi-label classification algorithm on each macro-average index is given in a table 6.

Table 5 comparison of hierarchical index performance of each algorithm:

table 6 macro-average performance comparison of algorithms:

the fact proves that the multi-label classification algorithm of the level can achieve better prediction performance than the existing method. By combining the characteristic that the Lazy-HMC algorithm supports incremental learning, an effective and applicable automatic case law identification system can be constructed by utilizing the Lazy-HMC algorithm.

The present invention provides a hierarchical multi-label classification method suitable for legal identification, and a plurality of methods and ways for implementing the technical scheme, and the above description is only a preferred embodiment of the present invention, it should be noted that, for those skilled in the art, a plurality of improvements and decorations can be made without departing from the principle of the present invention, and these improvements and decorations should also be regarded as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims

1. A hierarchical multi-label classification method suitable for legal identification is characterized by comprising the following steps:

step 1, acquiring an original text data set of a referee document, dividing the original text data set into a training sample set and a testing sample set, and preprocessing: extracting case facts and applicable legal provisions thereof from the case facts according to a literary structure of a referee document, wherein the case facts are used for generating feature vectors of case samples, the case samples comprise training samples and testing samples, the applicable legal provisions are used for representing class labels of the case samples, and an original text data set is converted into a semi-structured multi-label training sample set and a multi-label testing sample set; performing word segmentation and part-of-speech tagging on case fact description;

step 2, expanding legal provisions corresponding to case facts of all case samples based on a hierarchical structure of a label space formed by category labels in a multi-label training sample set, so that the category label corresponding to each case fact is a subset of the label space and meets the hierarchical limitation;

step 3, performing feature selection on the word segmentation result in the step 1, and selecting feature words capable of representing case facts to construct feature vectors; obtaining a structured extended multi-label training sample set Tr and an extended multi-label testing sample set Te through text representation;

step 4, constructing a prediction model: finding k neighbor sample sets N' (x) of unseen examples x from an extended multi-label test sample set Te in an extended multi-label training sample set Tr, wherein the unseen examples are the case facts to be classified, setting weight for each neighbor sample, calculating confidence coefficients of the unseen examples x belonging to each class according to the classification weight of the k neighbor samples to each class, predicting class label sets h (x) of the unseen examples x, wherein h (x) meets the hierarchical constraint, and finally removing the hierarchical constraint in the class label sets h (x) of the unseen examples x according to the tree structure of a label space to obtain the specific applicable legal provisions of the unseen examples;

in the step 1, randomly dividing an original text data set of a referee document into a training sample set and a testing sample set according to the proportion of 7: 3;

the step 2 comprises the following steps:

Is a real number set and contains q classes of label space Y ═ Y₁,y₂,…,y_q}，y_vThe label of the v-th class is represented, and v is more than or equal to 1 and less than or equal to q, then the class label space hierarchy uses the binary group

It is shown that,

shows the partial order relationship of the category label if y exists_v,y_uIs epsilon of Y and

then the category label y_vBelongs to category label y_u，y_vIs y_uDescendant class tag of, y_uIs y_vThe classifier must satisfy the hierarchical constraint on the set of class labels h (x) predicted to miss instance x, i.e.,

step 2-2, for each multi-labeled case sample (x)_i,h_i) I is more than or equal to 1 and less than or equal to m, m is the number of all the obtained referee document samples, x_ie.X' is a feature vector with d dimension for representing the fact part of the case,is equal to x_iA corresponding set of class labels, i.e. x_iCorresponding legal provision, the expanded category label set is h_i′，

Then h is_iIn which h is included_iAll class labels in (1) and all ancestor class labels thereof:

wherein y' is h_iThe ancestor class label of class label y in (1), y ∈ h_i；

The step 3 comprises the following steps:

step 3-1, converting multi-label sample data intoProcessing the single-label sample data: for each multi-labeled case sample (x)_i,h_i) (i is more than or equal to 1 and less than or equal to m) using | h_iI represents a category label set h of a multi-labeled case sample_iThe number of label categories in the label list is replaced by | h_iL single-label case sample (x)_i′,y_i′)(1≤i′≤|h_i|,y_i′∈h_i) Class label y for each single label sample_i′I.e. the category label set h_iA category label of; the multi-label case samples comprise multi-label training samples and multi-label testing samples;

step 3-2, through the conversion process of step 3-1, the multi-label case samples are converted into a plurality of single-label case samples, the case fact part of each single-label case sample is regarded as a document with words already segmented, the case fact parts of all single-label case samples form a document set, and the feature weight tf-idf of the jth 'dimension feature in the ith' document in the document set_i″j″The definition is as follows:

3-3, performing feature selection on the word segmentation result in the step 1 by using an information gain algorithm or a chi-square statistical algorithm, and selecting a certain number of feature words with distinguishing capability to form a feature space;

and (3) selecting features by adopting an information gain algorithm: the information gain ig (t) of the feature word t is defined as follows:

wherein, p (y)_v) Presentation category label y_vProbability of occurrence, p (t) represents probability of occurrence of feature word t, p (y)_vI t) indicates the category label y on the premise that the feature word t appears_vThe probability of occurrence of the event is,

indicates the probability that the feature word t does not occur,

indicating a category label y on the premise that the feature word t does not appear_vCalculating the information gain of each feature word in the document set according to the occurrence probability, wherein the feature words with the information gain value lower than a set threshold value are not included in a feature space;

and (3) selecting features by adopting a chi-square statistical algorithm:

let A be a label y containing a feature word t and belonging to a category_v1 < v < q, B is a label y containing a feature word t but not belonging to a category_vC is a label y belonging to a category without including a feature word t_vD is a label y which does not contain the feature word t and does not belong to the category_vN is the total number of documents in the document set, then the characteristic word t and the category label y_vChi square statistic of²(t,y_v) Is defined as:

By Chi² _avg(t)And

comprehensively considering, selecting a certain number of characteristic words with distinguishing capability, wherein p (y)_v) Presentation category label y_vProbability of occurrence:

in step 4, when k neighbors are found, no example x and no neighbor sample (x) in the extended multi-label training sample set are found_a,h_a) Distance d (x, x)_a) Wherein (x)_a,h_a)∈N′(x)，1≤a≤k，h_aIs x_aAnd calculating the reciprocal of cosine similarity of the feature vectors of the corresponding class labels, wherein the cosine similarity cos (gamma, lambda) of the feature vector gamma of the example x and the feature vector lambda of the adjacent sample is calculated by the following formula:

2. The method of claim 1, wherein: in step 4, d (x, x) is used_a) Indicating that example x is not found and the neighboring samples (x) in the extended multi-labeled training sample set_a,h_a) The distance of the extended multi-label training sample set is calculated by adopting a full label distance weight method or an entropy label distance weight method to h_aClass label y in (1)_jBy classification weight w_aj，1≤j≤q；

Calculation of w by full-label distance weight method_aj：

Calculation of w by entropy label distance weight method_aj：

wherein w_arRepresents h_aR class label y_rThe classification weight of (2);

the class label set h (x) for the prediction of missing instance x is: