CN107577785B - Hierarchical multi-label classification method suitable for legal identification - Google Patents

Hierarchical multi-label classification method suitable for legal identification Download PDF

Info

Publication number
CN107577785B
CN107577785B CN201710832304.8A CN201710832304A CN107577785B CN 107577785 B CN107577785 B CN 107577785B CN 201710832304 A CN201710832304 A CN 201710832304A CN 107577785 B CN107577785 B CN 107577785B
Authority
CN
China
Prior art keywords
label
class
feature
category
case
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710832304.8A
Other languages
Chinese (zh)
Other versions
CN107577785A (en
Inventor
柏文阳
陈朋薇
张剡
周嵩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201710832304.8A priority Critical patent/CN107577785B/en
Publication of CN107577785A publication Critical patent/CN107577785A/en
Application granted granted Critical
Publication of CN107577785B publication Critical patent/CN107577785B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a hierarchical multi-label classification method suitable for legal identification, which comprises the following steps: step 1, extracting case facts and legal provisions thereof from a preprocessed referee document; step 2, expanding legal provisions corresponding to case facts based on the hierarchical structure of the label space, and enabling category labels of case samples to be a subset of the label space; step 3, performing word segmentation and part-of-speech tagging on case fact texts, performing feature selection on segmentation results, and selecting feature words capable of sufficiently representing case facts to construct feature vectors; step 4, constructing a prediction model: finding out a k neighbor sample set N (x) of the unseen example x in the extended multi-label training set, setting weight for each neighbor sample, calculating confidence coefficient of the unseen example belonging to each class according to the classification weight of the k neighbor samples to each class, and finally predicting a class label set of the unseen example.

Description

Hierarchical multi-label classification method suitable for legal identification
Technical Field
The invention belongs to the field of computer data analysis and mining, and relates to a hierarchical multi-label classification method suitable for legal identification.
Background
Hierarchical multi-label classification is a special case of multi-label classification. Unlike general multi-label classification, in a hierarchical multi-label classification problem, each sample can have multiple class labels, while the sample label space is organized in a tree or directed acyclic graph hierarchy. In the directed acyclic graph, one node may have a plurality of father nodes, which is more complex than a tree structure and has greater difficulty in designing an algorithm, so that the current research on hierarchical multi-label classification mainly aims at the category label structure of the tree. According to different modes of observing category hierarchical structures by algorithms, the hierarchical multi-label classification algorithm can be divided into a local algorithm and a global algorithm.
And (3) the local algorithm inspects the local classification information of each internal node in the category hierarchy one by one, and converts the hierarchical multi-label classification problem into a plurality of multi-label classification problems. And when training the multi-label classifier on the internal node, an appropriate local sample set needs to be selected. And in the prediction stage, a top-down equal prediction mode is adopted to enable the prediction result to meet the hierarchical requirement. Document ESULI A, FAGNI T, SEBASTIANI F.TreeBoost.MH A boosting algorithm for multi-label text classification [ C ]// String Processing and formatting retrieval 2006: 13-24. TreeBoost.MH algorithm is proposed to handle the hierarchical multi-label text classification problem. The algorithm recursively trains multi-label classifiers on each non-leaf node in the class label tree, the base classifier selects adaboost. The experimental effect proves that the TreeBoost.MH algorithm is better than the AdaBoost.MH algorithm in time efficiency and prediction performance. The documents CERRI R, BARROS R C, DE CARVALHO AC. hierarchical multi-layer-label using local neural networks [ J ]. Journal of Computer and System sciences,2014,80(1): 39-56, a local hierarchical multi-label classification algorithm based on a multi-layer perceptron is proposed, a multi-layer perceptron network is trained at each layer of the category hierarchy, each neural network is associated with a category hierarchy for predicting the category label at the hierarchy, and the prediction result of the neural network at a certain layer is used as the input of the neural network at the next layer. Because each layer of neural network is trained on the same sample set, the prediction result can not meet the hierarchical constraint, and the prediction result needs to be subjected to subsequent processing to ensure that the prediction result meets the hierarchical constraint.
The local algorithm has the disadvantages that on one hand, a plurality of classifiers are required to be trained, so that the model is relatively complex, and the understandability of the model is influenced; on the other hand, a blocking problem occurs in the prediction process, that is, the samples which are misclassified at the upper layer cannot reach the classifier at the lower layer, and although three strategies of reducing the threshold, limiting voting and expanding threshold multiplication are proposed to solve the blocking problem of the local algorithm, the local algorithm is not ideal in prediction accuracy.
The global algorithm considers the hierarchical structure of the category as a whole, trains a single-level multi-label classifier and predicts unseen examples. Global algorithms can be mainly classified into the following according to the way they process class label hierarchies: one global algorithm is to use class clustering to first compute the similarity of the test samples to each class and then classify the test samples into the closest class. The other method is to convert the hierarchical multi-label classification problem into a multi-label classification problem for processing: documents KIRITCHENKO S, MATWIN S, famill f.functional organization of genetic temporal organization [ J ],2005. extend the class labels of training samples, add their ancestor class labels, convert the hierarchical multi-label classification problem into a multi-label classification problem for processing. In the testing stage, the adopted multi-label classification algorithm AdaBoost.MH does not consider the hierarchical structure of the categories, so the same problem as that of a local algorithm is faced, namely, the predicted result has the situation of inconsistent hierarchies, and the output of the model also needs to be corrected to ensure that the hierarchy limit is met. There is also a global algorithm that adapts existing non-hierarchical classification algorithms to directly process hierarchical information and use the hierarchical information to improve performance. The literature VENS C, STRUFF J, SCHIETGAT L, et al, precision treeesfor hierarchical multilabelellar classification [ J ] Machine Learning,2008,73(2): 185-. Experimental results show that the global Clus-HMC algorithm is better than the Clus-SC and Clus-HSC algorithms in prediction performance and is better in time efficiency.
In general, global algorithms have two features: considering the hierarchical structure of the categories as a whole once; there is no modularity specific to the local algorithm. The key difference between the global algorithm and the local algorithm lies in the training process, and in the testing stage, the global algorithm can even use a top-down mode like the local algorithm to predict the category of the unseen instances.
Since the organization of class labels in the hierarchical multi-label classification problem is hierarchical, if a sample has a class label ciThen the sample implicitly has ciAll ancestor category labels of; on the other hand, in predicting the category of the unseen instance, the hierarchical constraint is also satisfied, i.e., it cannot happen that the unseen instance belongs to a category and not to an ancestor category of the category. A general hierarchical multi-label classification algorithm often cannot ensure that a prediction result meets the hierarchical limitation, or cannot obtain the optimal learning effect because the hierarchical structure features of a label space are not utilized. Therefore, the hierarchical multi-label classification algorithm not only needs to make full use of the association and the hierarchical structure between the class labels to improve the prediction performance of the classification model, but also needs to enable the prediction result to meet the hierarchical limitation.
The problem of automatic identification of case applicable laws is essentially a hierarchical multi-label classification problem, the category labels of the samples, namely, the legal provisions applicable to the cases, are organized in a tree structure, one case may be applicable to a plurality of legal provisions, and the specific degrees of the legal provisions applicable to the case may be different. The corresponding hierarchical multi-label classification algorithm for solving the problem of automatic identification of case applicable law needs to be capable of processing tree-shaped class hierarchical structures, and is a non-mandatory leaf node prediction algorithm, and predicted class labels can correspond to any nodes in the class hierarchical structures.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to solve the technical problem of providing an effective hierarchical multi-label classification method suitable for legal identification aiming at the defects of the prior art.
The technical scheme is as follows: the invention discloses a hierarchical multi-label classification method suitable for legal identification, which comprises the following steps:
step 1, crawling a required referee document original text data set from the Internet by using a crawler technology based on a jsup, wherein one referee document corresponds to one sample, and randomly dividing the referee document into a training sample set and a testing sample set according to a ratio of 7: 3. Then, preprocessing the official document: extracting case facts and applicable legal provisions thereof from the case facts according to a literary structure of a referee document, wherein the case facts are used for generating feature vectors of case samples, the case samples comprise training samples and test samples, the applicable legal provisions are used for representing class labels of the case samples, an original text data set is converted into a semi-structured multi-label training sample set and a semi-structured test sample set, and the semi-structured sample form is as follows: (case fact description, legal provisions text); correcting errors and format inconsistency in case-applicable legal provisions; the language processing method is characterized in that a language technology platform LTP of Hadamard is used as a language processing tool (the LTP is a whole set of Chinese language processing system, a language processing result representation based on XML is formulated, and a whole set of abundant and efficient Chinese language processing modules (including six Chinese processing core technologies such as lexical, syntactic and semantic) from bottom to top, an application program interface and a visualization tool based on a Dynamic Link Library (DLL) and capable of being used in a network service mode) are provided on the basis of the language processing result representation based on XML to perform word segmentation and part of speech tagging on case fact description.
And 2, because the organization of the legal provision in the legal system is in a tree structure, correspondingly, the label space formed by the category labels in the multi-label training set is in a tree structure. Based on a hierarchical structure of a label space formed by category labels in a multi-label training sample set, legal provisions corresponding to case facts of all case samples are expanded, so that the category label corresponding to each case fact is a subset of the label space and meets the hierarchical limitation;
step 3, performing feature selection on the word segmentation result from the training set in the step 1 (the word segmentation result refers to the case fact part of the semi-structured multi-label training set in the step 1), and selecting feature words capable of fully representing the case fact to construct feature vectors; obtaining a structured extended multi-label training sample set Tr and an extended multi-label testing sample set Te through text representation;
step 4, constructing a prediction model: finding k neighbor sample sets N (x) of unseen examples x from the extended multi-label test sample set Te in the extended multi-label training sample set Tr, setting weight for each neighbor sample, calculating confidence coefficients of the unseen examples x belonging to each category in the label space according to the classification weight of the k neighbor samples to each category in the label space, predicting category label sets h (x) of the unseen examples x, and h (x) meeting the hierarchical constraint. And finally, removing the hierarchical restriction in the category label set h (x) of the predicted unseen example x according to the tree structure of the label space (namely, the reverse process of label expansion) to obtain the specific applicable legal provisions of the unseen example. .
The step 2 comprises the following steps:
step 2-1, in the hierarchical multi-label classification problem, a d-dimensional instance space is given
Figure GDA0002300851710000041
(Is a real number set) and a label space Y containing q classes { Y ═ Y1,y2,…,q},yvThe label of the v-th class is represented, v is more than or equal to 1 and less than or equal to q, the spatial hierarchy of the class label can be represented by a binary group (Y and less), the < represents the partial order relation of the class label, if Y existsv,yuE is Y and Yv<yuThen class label yvBelongs to category label yu,yvIs yuDescendant class tag of, yuIs yvThe ancestor class label, < represents the partial order relationship of the class label, the partial order relationship < can be understood as "belonging to" the relationship, i.e., if there is yv,yuE is Y and Yv<yuThen class label yvBelongs to category label yu,yvIs yuDescendant class tag of, yuIs yvThe partial order relationship < has asymmetry, non-reflexivity and transitivity, and can be described by the following four characteristics:
a) the only root node in the class label hierarchy is represented by a virtual class label R for any yiE is Y, has Yi<R;
b) For any yi,yjE.g. Y, if there is Yi<yjThen, then
Figure GDA0002300851710000043
c) Arbitrary yiIs e.g. Y, has
d) Arbitrary yi,yj,yk∈Y,yi<yjAnd y isj<ykThen there is yi<yk
The multi-label classification problem in which the organizational structure of the category labels satisfies the above four features can be regarded as a hierarchical multi-label classification problem. As can be seen from the above formal definitions, in the hierarchical class label space, all other class nodes (excluding the start node) on the unique path formed by tracing back from any class node to the root node are ancestor class nodes of the class node. Thus if the sample has a class label yiThen the sample implicitly also has yiAll ancestor class labels of, this requires that the classifier must satisfy a hierarchical constraint on the set of class labels h (x) that predict unseen instance x, i.e.,
Figure GDA0002300851710000051
and y '< y': y ', (x) where y' is the class label in h (x) and y 'is an ancestor class label for y';
step 2-2, for each multi-labeled case sample (x)i,hi) I is more than or equal to 1 and less than or equal to m, m is the number of all the obtained referee document samples, xie.X is a feature vector with d dimension for representing the fact part of the case,
Figure GDA0002300851710000052
is equal to xiA corresponding set of class labels, i.e. xiCorresponding legal provision, the expanded category label set is hi′,Then h isiIn' contains hiAll category labels in (1) and all ancestor category labels thereof. In a formalized way, the method comprises the following steps of,
Figure GDA0002300851710000054
wherein y' is hiThe ancestor class label of class label y in (1), y ∈ hi
The label extension process explicitly expresses the hierarchical relationship of the category labels in the category labels of the sample: if a sample is marked as a certain category, then the ancestor categories of the categories are also explicitly assigned to the sample through label expansion; the category label of each sample can be viewed as a subtree of the label space tree, and the top level of each subtree is the root node. It can be seen that if there is yi,yjE is Y and Yi<yjFor example, in the k neighbor sample in the extended multi-label training set, there is a class label yiMust not be less than having a class label yjThe number of samples of (1). The label expansion is an important step for ensuring that the prediction result of the learning algorithm meets the level limit.
The step 3 comprises the following steps:
and 3-1, the purpose of feature selection is to reduce dimensions of features, and since a common text feature selection algorithm cannot directly process a multi-label data set, multi-label sample data needs to be converted into single-label sample data for processing. The conversion method comprises the following steps: for each multi-labeled case sample (x)i,hi) (i is more than or equal to 1 and less than or equal to m) using | hiI represents a category label set h of a multi-labeled case sampleiThe number of label categories in the label list is replaced by | hiL single-label case sample (x)i′,yi′)(1≤i′≤|hi|,yi′∈hi) Class label y for each single label samplei′I.e. the category label set hiA category label of; the multi-label case samples comprise multi-label training samples and multi-label testing samples; table 1 gives an example of converting a multi-label exemplar to a single label exemplar according to the above strategy.
TABLE 1 Multi-tag sample conversion Process
Figure GDA0002300851710000061
And 3-2, converting the multi-label case sample into a case sample with a plurality of single labels through the conversion process of the step 3-1, performing feature selection on the word segmentation result obtained from the original training set in the step 1 by using a general feature selection algorithm, selecting a certain number of feature words with distinguishing capability (usually, the total information gain of the selected feature words is as large as possible and the number of the feature words is not too much, for example, when the feature selection is performed by using an information gain algorithm, at least 100 feature words are generally selected) to form a feature space, and using the feature words from the feature space to represent the case fact part of each case sample. Wherein, the attribute value corresponding to each feature word, namely the feature weight, is calculated by adopting a commonly used TF-IDF algorithm. And (3) regarding the case fact part of each case sample with the single label as a document with words segmented, and then combining the case fact parts of all case samples with the single label into a document set. Feature weight tf-idf of j-th dimension feature in ith document in document seti″j″The definition is as follows:
wherein tf-idfi″j″Representation feature word tj″In document di″Frequency of occurrence of idfj″Representation feature word tj″Inverse document frequency in a document collection, N represents the total number of documents in the document collection, Nj″Representation feature word tj″Frequency of documents in a document set, i.e. occurrence of a characteristic word t in a document setj″The denominator of the number of documents is a normalization factor;
and 3-3, performing feature selection on the word segmentation result in the step 1 by using an information gain algorithm or a chi-square statistical algorithm, and selecting about 100 feature words with the highest distinguishing capability to form a feature vector. Commonly used text feature selection methods are mainly based on Document Frequency (DF), Mutual Information (MI), Information Gain (IG), chi-square statistics (χ)2Statistical, CHI) and the like. The feature selection based on the document frequency is too simple, the feature words with the most classified information cannot be selected, and the mutual information has the defect that the mutual information is easily influenced by the marginal probability of the feature words, so that the hierarchical multi-label classification method selects information gain or chi-square statistical algorithm to select the features.
Step 3-3 comprises: and (3) selecting features by adopting an information gain algorithm: the information gain ig (t) of the feature word t is defined as follows:
Figure GDA0002300851710000071
wherein, p (y)v) Presentation category label yvProbability of occurrence, p (t) represents the probability of occurrence of the feature word t,indicating a category label y on the premise of the occurrence of a characteristic word tvThe probability of occurrence of the event is,
Figure GDA0002300851710000073
indicates the probability that the feature word t does not occur,
Figure GDA0002300851710000074
indicating a category label y on the premise that the feature word t does not appearvThe probability of occurrence, for each feature word in the document set, calculating the information gain, the information gain value is lower than the set threshold (for example, 0.15, the threshold is set so that the total information gain of the selected feature words is as large as possible and the number of feature words is not largeToo many) feature words are not included in the feature space;
step 3-3 can also adopt chi-square statistical algorithm to carry out feature selection: it is assumed that the feature words are not related to class, and if the test value calculated using the CHI distribution deviates more from the threshold, then there is more confidence in negating the original hypothesis, accepting an alternative hypothesis to the original hypothesis: i.e. the characteristic words have a high degree of correlation with the categories.
Let A be a label y containing a feature word t and belonging to a category v1 < v < q, B is a label y containing a feature word t but not belonging to a categoryvC is a label y belonging to a category without including a feature word tvD is a label y which does not contain the feature word t and does not belong to the categoryvN is the total number of documents in the document set, then the characteristic word t and the category label yvChi square statistic of2(t,yv) Is defined as:
Figure GDA0002300851710000075
characteristic word t and category yvWhen independent, the chi-square statistic is 0, the chi-square statistic about each category is calculated for one feature word, and then the mean χ is calculated respectively2 avg(t) and maximum value
Figure GDA0002300851710000076
By Chi2 avg(t) and
Figure GDA0002300851710000077
taking into account, a certain number (about 100) of feature words with differentiating capability are selected, wherein p (y)v) Presentation category label yvProbability of occurrence:
the main advantage of the chi-squared statistical feature selection algorithm over mutual information is that it is a normalized value, and therefore can better scale different feature words in the same category.
In step 4, when k neighbors are found, no example x and no neighbor sample in the extended multi-label training sample set are found(xa,ha) Distance d (x, x)a) Wherein (x)a,ha)∈N′(x),1≤a≤k,,haIs xaAnd calculating the reciprocal of cosine similarity of the feature vectors of the corresponding class labels, wherein the cosine similarity cos (gamma, lambda) of the feature vector gamma of the example x and the feature vector lambda of the adjacent sample is calculated by the following formula:
Figure GDA0002300851710000081
where S denotes the index of the component of the feature vector, i.e. the position of the component in the feature vector, S denotes the dimension of the feature vector, γsDenotes the s-th component, λ, of the feature vector γsRepresenting the s-th component of the feature vector lambda.
In step 4, d (x, x) is useda) Indicating that example x is not found and the neighboring samples (x) in the extended multi-labeled training sample seta,ha) The distance of the extended multi-label training sample set is calculated by adopting a full label distance weight method or an entropy label distance weight method to haClass label y in (1)jBy classification weight waj,1≤j≤q;
Calculation of w by full-label distance weight methodaj
Figure GDA0002300851710000082
Calculation of w by entropy label distance weight methodaj
Figure GDA0002300851710000083
Unseen instance x belongs to the class label yjConfidence of (c) (x, y)j) The calculation formula is as follows:
Figure GDA0002300851710000084
wherein warRepresents haR ofIdentification label yrThe classification weight of (2);
the class label set h (x) for the prediction of missing instance x is:
Figure GDA0002300851710000091
and selecting 0.5 as a decision threshold, and when the confidence of the unseen example x belonging to each class label is smaller than the decision threshold, returning the class label with the maximum confidence as the class label of the unseen example.
As a hierarchical multi-label classification method, the prediction result thereof needs to satisfy the hierarchical constraint, that is,
Figure GDA0002300851710000092
and y '< y' ∈ h (x). The following is a demonstration: from the confidence calculation formula, if the algorithm predicts that no instance x has a class label ya(yaE Y), then x belongs to category YaConfidence of (c) (x, y)a) Greater than the threshold t or maximum in all categories. Investigation class yaAncestor class y ofb(yb∈Y,ya<yb) If y isbCorresponding to the virtual root node in the category hierarchy, then x has a category label yaClearly meeting the hierarchical constraint; otherwise, for any neighbor sample of x (x)i,yi) e.N (x), if ya∈YiThen y is also presentb∈YiAnd otherwise, the result is not necessarily true, and the label extension process of the training set ensures that the conclusion is true. Therefore, with the full tag distance weight method and the entropy tag distance weight method, it can be derived:
Figure GDA0002300851710000099
on the denominator
Figure GDA00023008517100000910
max1≤r≤qwirRemain unchanged, so x belongs to class ybConfidence of (c) (x, y)b) X belongs to the category yaConfidence of (c) (x, y)a) If there is c (x, y)a)>t, must also have c (x, y)b)>t, so the prediction results satisfy the hierarchical constraint.
Finally, the performance evaluation index of the learning method adopts a hierarchical evaluation index: the hierarchical precision (hP), the hierarchical recall (hR) and the hierarchical F metric (hF) are defined as follows:
Figure GDA0002300851710000094
Figure GDA0002300851710000096
wherein the content of the first and second substances,
Figure GDA0002300851710000097
is the set of classes to which the predicted test sample i belongs and its ancestor classes,
Figure GDA0002300851710000098
is the set of classes to which the test sample i actually belongs and its ancestor classes, and the summation operation is to compute the values over all test samples.
In order to make the identification of case applicable laws more practical, the target category predicted by the algorithm is preferably a specific legal provision, not just a broad law, so the method considers the prediction performance of the target category in both cases of the whole legal provision and the specific legal provision. Hereinafter, the hierarchical precision, recall rate and F metric value of the system when the target category is all legal provisions are denoted by hP _ all, hR _ all and hF _ all, respectively, and the hierarchical precision, recall rate and F metric value of the algorithm when the target category is a specific legal provision are denoted by hP _ partial, hR _ partial and hF _ partial.
Besides the hierarchical evaluation index, the precision, the recall rate and the F metric value of each category can be calculated respectively, and the average value of the precision, the recall rate and the F metric value of all the categories is used as the evaluation index of the system performance, namely Macro-averaging (Macro-averaging) of the precision, the recall rate and the F metric value. For each category, let TP denote the number of true positive examples, FP denote the number of false positive examples, TN denote the number of true negative examples, and FN denote the number of false negative examples, the formula for computing the Macro-average Macro-P, Macro-R, Macro-F of accuracy, recall rate, and F value is as follows:
Figure GDA0002300851710000102
the invention relates to a global hierarchical multi-label classification method, which considers the hierarchical structure of class labels on the whole and ensures that the prediction result also meets the hierarchical limitation. The learning method is an inertia learning algorithm, a clear prediction model is not required to be constructed on a training set, and only the original multi-label sample is subjected to label expansion and then stored, so that incremental learning is supported; in the prediction stage, k adjacent samples of the unseen examples in the training set are firstly found, the confidence coefficient of the examples belonging to each class is determined according to the classification weight of the adjacent samples to each class, and then the class of the unseen examples is predicted. The learning method is simple in model, supports incremental learning, and can be well applied to automatic identification of the problem of multi-level multi-label classification which contains massive data and continuously increases data in case-applicable law.
Has the advantages that: the hierarchical multi-label classification method suitable for legal recognition provided by the invention fully considers the tree-shaped hierarchical structure of the legal provision label space on the whole, so that the prediction result meets the hierarchical limitation, and the prediction result does not need to be additionally corrected. Meanwhile, the method is simple in model, supports incremental learning, and can be well applied to automatic identification of the problem of multi-level multi-label classification which contains massive data and continuously increases data in case-applicable law.
Drawings
The foregoing and other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a main flow chart of the present invention.
FIG. 2 is a sample official document.
FIG. 3 legal provisions tag space tree structure.
The legal provisions of fig. 4 combine frequency distributions.
Fig. 5 shows the performance comparison of the hierarchical indexes under different neighbor numbers.
Figure 6 compares macro average indicator performance for different neighbor numbers.
FIG. 7 is a comparison of performance of indexes under different weighting strategies.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
The invention discloses a hierarchical multi-label classification method suitable for legal identification, which comprises the following steps:
step 1, crawling a required referee document original text data set from the Internet by using a crawler technology based on jsup, and randomly dividing the referee document original text data set into a training set and a testing set according to a ratio of 7: 3. Then, preprocessing the official document, and mainly completing the following work:
extracting case facts and applicable legal provisions thereof from the case facts according to a literary structure of a referee document, wherein the case facts are used for generating feature vectors of case samples, and the applicable legal provisions are used for representing category labels of the case samples, and converting an original text data set into a semi-structured multi-label training set and a semi-structured testing set;
correcting errors and format inconsistency in case-applicable legal provisions;
and utilizing a language technology platform LTP of the Hadamard to perform word segmentation and part of speech tagging on the case fact description.
And 2, because the organization of the legal provision in the legal system is in a tree structure, correspondingly, the label space formed by the category labels in the multi-label training set is in a tree structure. Based on the hierarchical structure of the label space, expanding legal provisions corresponding to case facts of all samples, and enabling a category label set corresponding to each case fact to be a subset of the label space and meet the hierarchical limitation;
step 3, performing feature selection on the word segmentation result obtained from the original training set in the step 1, and selecting feature words capable of sufficiently representing case facts to construct feature vectors; obtaining a structured extended multi-label training set Tr and a test set Te through text representation;
step 4, constructing a prediction model: finding k neighbor sample sets N (x) of unseen examples x from the extended multi-label test set Te in the extended multi-label training set Tr, setting weight for each neighbor sample, calculating confidence coefficients of the unseen examples belonging to all classes in the label space according to the classification weight of the k neighbor samples to all classes in the label space, predicting class label sets h (x) of the unseen examples, and h (x) meeting the hierarchical constraint. And finally, removing the hierarchical restriction in the prediction type set h (x) (namely the inverse process of label expansion) according to the tree structure of the label space to obtain the specific applicable legal provisions of the unseen examples.
The step 2 comprises the following steps:
step 2-1, in the hierarchical multi-label classification problem, a d-dimensional instance space is given
Figure GDA0002300851710000121
And a label space Y containing q classes { Y ═ Y1,y2,…,yq},yiRepresenting the ith class, the class label spatial hierarchy can be represented by a binary (Y, <) < representing a partial ordering relationship for the class labels, which can be understood as "belonging to" the relationship, i.e., if there is Yi,yjE is Y and Yi<yjThen class yiBelong to the category yj,yiIs yjDescendant class of, yjIs yiThe ancestor class of (1). The partial order relationship < has asymmetry, non-reflexibility and transitivity, and can be described by the following four characteristics:
e) the only root node in the class label hierarchy is represented by a virtual class label R for any yiE is Y, has Yi<R;
f) For any yi,yjE.g. Y, if there is Yi<yjThen, then
Figure GDA0002300851710000122
g) Arbitrary yiIs e.g. Y, has
Figure GDA0002300851710000123
h) Arbitrary yi,yj,yk∈Y,yi<yjAnd y isj<ykThen there is yi<yk
The multi-label classification problem in which the organizational structure of the category labels satisfies the above four features can be regarded as a hierarchical multi-label classification problem. As can be seen from the above formal definitions, in the hierarchical class label space, all other class nodes (excluding the start node) on the unique path formed by tracing back from any class node to the root node are ancestor class nodes of the class node. Thus if the sample has a class label ciThen the sample implicitly has ciAll ancestor class labels of, this requires that the classifier also satisfy the hierarchy constraint for the set of prediction classes for unseen instances, h (x), i.e.,
Figure GDA0002300851710000124
and y '< y' ∈ h (x).
Step 2-2, for any training sample (x)i,yi) (i is more than or equal to 1 and less than or equal to m), m is the number of all the obtained referee document samples, xiE X is dThe feature vector of the dimension(s),
Figure GDA0002300851710000125
is equal to xiA corresponding set of category labels. Let the expanded category label set be yi', then yi' therein contains yiAll category labels in (1) and all ancestor category labels thereof. In a formalized way, the method comprises the following steps of,
Figure GDA0002300851710000126
the label extension process explicitly expresses the hierarchical relationship of the category labels in the category labels of the sample: if a sample is marked as a certain category, then the ancestor categories of the categories are also explicitly assigned to the sample through label expansion; the category label of each sample can be viewed as a subtree of the label space tree, and the top level of each subtree is the root node. It can be seen that if there is yi,yjE is Y and Yi<yjFor example, in the k neighbor sample in the extended multi-label training set, there is a class label yiMust not be less than having a class label yjThe number of samples of (1). The label expansion is an important step for ensuring that the prediction result of the learning algorithm meets the level limit.
The step 3 comprises the following steps:
and 3-1, the purpose of feature selection is to reduce dimensions of features, and since a general text feature selection algorithm cannot directly process a multi-label data set, multi-label data needs to be converted into single-label data for processing. The conversion method comprises the following steps: for each multi-label sample (x, h), the number of label categories in the label category set h is represented by | h |, and is replaced by | h | new single-label samples (x, y)i)(1≤i≤|y|,yiE h), class y for each new sampleiThat is, a category label in the original multi-label sample category label set h, table 1 gives an example of converting a multi-label sample into a single-label sample according to the above strategy.
TABLE 1 Multi-tag sample conversion Process
Figure GDA0002300851710000131
And 3-2, converting the multi-label case sample into a single-label case sample through the conversion process of the step 3-1, performing feature selection on the word segmentation result obtained from the original training set in the step 1 by using a general feature selection algorithm, and selecting about 100 feature words with the highest distinguishing capability to form a feature space. And (3) representing the case fact part of each case sample by using feature words from the feature space, wherein the attribute value, namely the feature weight, corresponding to each feature word is calculated by adopting a common TF-IDF algorithm. And considering the case fact part of each sample as a document with words segmented, the case fact parts of all samples form a document set. Feature weight tf-idf of jth dimension feature in ith documentijThe definition is as follows:
Figure GDA0002300851710000132
wherein, tfijRepresentation feature word tjIn document diFrequency of occurrence of idfjRepresentation feature word tjInverse document frequency in a document collection, N represents the total number of documents in the document collection, NjRepresentation feature word tjFrequency of documents in a document set, i.e. occurrence of a characteristic word t in a document setjThe denominator is the normalization factor.
And 3-3, performing feature selection on the word segmentation result obtained from the original training set in the step 1, and selecting a certain number of feature words with distinguishing capability to form feature vectors. Commonly used text feature selection methods are mainly based on Document Frequency (DF), Mutual Information (MI), Information Gain (IG), chi-square statistics (χ)2Statistical, CHI) and the like. The feature selection based on the document frequency is too simple, the feature words with the most classified information cannot be selected frequently, and the mutual information has the defect that the mutual information is easily influenced by the marginal probability of the feature words, so that the information is selected by the hierarchical multi-label classification methodAnd carrying out feature selection by using a gain or chi-square statistical algorithm.
Step 3-3 comprises: and (3) selecting features by adopting an information gain algorithm: the information gain ig (t) of the feature word t is defined as follows:
Figure GDA0002300851710000141
wherein, Pr(yi) Represents a category yiProbability of occurrence, Pr(t) represents the probability of occurrence of the feature t, Pr(yiI t) represents the category y on the premise that the feature t appearsiThe probability of occurrence of the event is,
Figure GDA0002300851710000142
indicating the probability that the feature t does not occur,
Figure GDA0002300851710000143
indicating class y without the occurrence of feature tiThe probability of occurrence. And calculating the information gain of each feature word in the document set, wherein the feature words with the information gain value lower than a set threshold value are not included in the feature space.
Step 3-3 can also adopt the chi-square statistical algorithm to carry on the feature selection to the case fact text in the training set: it is assumed that the feature words are not related to class, and if the test value calculated using the CHI distribution deviates more from the threshold, then there is more confidence in negating the original hypothesis, accepting an alternative hypothesis to the original hypothesis: i.e. the characteristic words have a high degree of correlation with the categories. Let A be the number of documents containing a feature word t and belonging to category y, B be the number of documents containing a feature word t but not belonging to category y, C be the number of documents not containing a feature word t but belonging to category y, D be the number of documents not containing a feature word t but not belonging to category y, and N be the total number of documents, then chi-square statistic χ of feature word t and category y2(t, y) is defined as:
Figure GDA0002300851710000144
characteristic word t and category y are uniqueAt that time, the chi-square statistic is 0, the chi-square statistic about each category is calculated for one feature word, and then the mean χ is calculated respectively2 avg(t) and the maximum value χ2 max(t), comprehensively considering the two modes, and selecting the most distinguishing characteristic words:
χ2 avg(t)=∑i=1Pr(yi2(t,yi),
X2 max(t)=maxi=1,...,qX2(t,yi)。
Pr(yi) Represents a category yiThe probability of occurrence. The main advantage of the chi-squared statistical feature selection algorithm over mutual information is that it is a normalized value, and therefore can better weigh different feature words in the same category.
In step 4, when k is found to be close to the neighbor, no example x and no sample (x) are foundi,hi) Distance d (x, x)i) The inverse of the cosine similarity of their feature vectors is used for the measurement. The cosine similarity cos (γ, λ) between the feature vector γ of the unseen example and the feature vector λ of the neighboring sample is calculated as follows:
Figure GDA0002300851710000151
where S denotes the index of the vector component, i.e. the position of the component in the vector, S denotes the vector dimension, γsDenotes the s-th component of the vector y, λsRepresenting the s-th component of the vector lambda.
In step 4, d (x, x) is usedi) Show example x and sample (x)i,hi) Using full label distance weight method to calculate sample ((x)i,hi) e.N (x)) for class yjBy classification weight wij
Calculation of w by full-label distance weight methodij
Figure GDA0002300851710000152
Calculation of w by entropy label distance weight methodij
Figure GDA0002300851710000153
Unseen examples belong to category yjConfidence of (c) (x, y)j) The calculation formula is as follows:
Figure GDA0002300851710000154
Figure GDA0002300851710000155
Figure GDA0002300851710000156
and selecting 0.5 as a decision threshold, and returning the class with the highest confidence as the class to which the unseen instance belongs when the confidence of the unseen instance belonging to each class is less than the decision threshold.
Examples
As shown in fig. 1, the steps of the present invention are:
step one, crawling a required referee document original text data set from the Internet by using a crawler technology based on a jsup, and randomly dividing the referee document original text data set into a training set and a testing set according to a ratio of 7: 3. Then, preprocessing the official document, and mainly completing the following work:
extracting case facts and applicable legal provisions thereof from the case facts according to a literary structure of a referee document, wherein the case facts are used for generating feature vectors of case samples, and the applicable legal provisions are used for representing category labels of the case samples, and converting an original text data set into a semi-structured multi-label training set and a semi-structured testing set;
correcting errors and format inconsistency in case-applicable legal provisions;
and utilizing a language technology platform LTP of the Hadamard to perform word segmentation and part of speech tagging on the case fact description.
Expanding legal provisions corresponding to case facts of all samples based on a hierarchical structure of a label space, so that a category label corresponding to each case fact is a subset of the label space and meets the hierarchical limitation;
step three, performing feature selection on the word segmentation result obtained from the original training set in the step 1, and selecting feature words capable of sufficiently representing case facts to construct feature vectors; obtaining a structured extended multi-label training set Tr and a test set Te through text representation;
step four, constructing a prediction model: firstly, finding out k neighbor sample sets N (x) of unseen examples x from an extended multi-label test set Te in an extended multi-label training set Tr, setting weight for each neighbor sample, calculating confidence coefficients of the unseen examples belonging to each category in a label space according to the classification weight of the k neighbor samples to each category in the label space, predicting category label sets h (x) of the unseen examples, and h (x) meeting the hierarchical constraint. And finally, removing the hierarchical restriction in the prediction type set h (x) (namely the inverse process of label expansion) according to the tree structure of the label space to obtain the specific applicable legal provisions of the unseen examples.
The implementation data is obtained from official documents of people's court at all levels of Zhejiang province published by Zhejiang court.
FIG. 2 is a sample official document in which the straight underline marked portion is the case fact portion and the curved underline marked portion is the applicable legal provisions for the case. And extracting case facts and legal provisions thereof according to the law of the official document. The pretreatment work is mainly the cleaning and the correction of the applicable legal part of the case.
In fig. 3, a tree structure of a legal provision tag space is shown. Based on the hierarchical structure, the legal provision corresponding to each case fact is subjected to label expansion.
Fig. 4 is a legal provision combination histogram. According to the frequency of citation of each legal provision, 26 laws such as ' people's republic of China ' litigation law, ' people's republic of China ' and ' 451 specific legal provisions contained in the laws are selected as category labels to form a label space, namely, the dimension of the label space is 477. The set of category labels for each case sample is represented in the form of a label vector, each dimension of which represents a category label in the label space, i.e., a complete legal provision. If a case is applicable to a certain legal provision, the corresponding label entry values of the legal provision and all legal provisions containing the legal provision in the label vector are both 1, otherwise, the corresponding label entry values are 0. Therefore, the label vector of each sample corresponds to one legal provision combination, the frequency of occurrence of each combination is the number of corresponding case samples, and the frequency of occurrence of each legal provision combination can also reflect some properties of the case sample set. By calculating each, and selecting the combination with higher frequency of occurrence and arranging it in order from large to small, fig. 4 can be obtained. As can be seen from the figure, the occurrence frequency of the legal provision combinations is approximately in a long tail distribution, the occurrence frequency of a few legal provision combinations is extremely high, which indicates that a large number of case samples are suitable for the legal provision combinations, and in addition, the occurrence frequency of most legal provision combinations is relatively balanced.
And step three, selecting an information gain algorithm to select characteristics. Through calculating the information gain of each feature word, it can be found that most words with higher information gain are verbs or nouns, and table 2 shows the proportion of verbs and nouns in the feature words with the highest information gain value, so that the nouns and verbs have higher distinguishing capability in the problem of legal identification compared with words with other properties, and on the other hand, the words except the verbs and nouns in the text can be removed through part-of-speech tagging, so that the number of words in the text is reduced, and the subsequent calculation is simplified.
Table 2 ratio of verb nouns in feature words:
number of feature words Verb noun number ratio Verb noun information gain total proportion
100 88.0% 87.9%
200 80.0% 82.3%
300 81.0% 82.5%
400 80.5% 82.0%
500 76.8% 79.7%
Table 3 summary of experimental training set and test set:
number of samples Sample average class label number
Training set 102608 7.6344
Test set 44210 7.6397
Fig. 5 and 6 are comparisons of the performance of the hierarchical index and the macro-average index when different numbers of neighbors are taken.
As can be seen from fig. 5: when the number of the neighbors is an even number, the precision of the algorithm is high, and the recall rate is low; when the number of the neighbors is odd, the precision of the algorithm is low, and the recall rate is high. This distinction becomes progressively smaller as the number of neighbors increases. This phenomenon can be explained by analyzing the principles of the algorithm: the decision threshold set by the algorithm is 0.5, and when the number of neighbors is even, only the class label with the occurrence frequency exceeding k-2 is predicted as the class label of the unseen instance due to the addition of the smoothing parameter, and the class label with the occurrence frequency just equal to k-2 is not endowed with the unseen instance. Therefore, when the number of neighbors is even, the condition that each class label endows the unseen instance is more severe, so that the prediction precision of the algorithm is higher, and the recall rate is correspondingly lower. This effect is gradually reduced as the number of neighbors increases, and thus the difference becomes smaller. It can also be seen from the figure that when the target category is all legal provisions, each prediction index of the algorithm is higher than that when the target category is a specific legal provision. This is because broader legal categories contain more case samples, thus making the model more predictive in these categories. In summary, when the k value of the number of neighbors is 5, the comprehensive prediction performance of the algorithm is the best.
From fig. 6 it can be found that: as the number of neighbors increases, the macro-average precision, recall ratio and F metric value of the algorithm all decrease. The reason for this may be that as the number of neighbors increases, it is more difficult for the classes with fewer samples to reach the decision threshold, thus leading to a decrease in the prediction performance of most classes and ultimately to a decrease in the corresponding macro average performance.
Fig. 7 shows the performance of the algorithm on each evaluation index when the number of the fixed neighbors is 5 and the sample weight strategy is a full label distance weight method and an entropy label distance weight method, respectively. In summary, regardless of hierarchical indexes or macro-average indexes, the entropy label distance weighting strategy can achieve better effect on precision, and the full label distance weighting strategy can achieve better effect on recall rate and F metric value. The entropy label weight strategy is biased to samples with fewer class labels, and in the expanded hierarchical multi-label samples, the more specific the class to which the sample belongs, the more class labels, the smaller the classification weight under the entropy label weight strategy, so that the prediction result is more biased to the upper class by adopting the entropy label weight strategy, and the larger the generalization error. Although the algorithm has a decline in performance when the target category is a specific legal provision, there is still a hierarchical accuracy close to 80% and a hierarchical recall rate of more than 65%, indicating that case-applicable legal identification based on the present hierarchical multi-label classification algorithm is valid.
In consideration of the two cases that the target category is all legal provisions and specific legal provisions, the macro average precision, recall rate and F metric value of the algorithm when the target category is all legal provisions are respectively represented by mP _ all, mP _ all and mP _ all, and the macro average precision, recall rate and F metric value of the algorithm when the target category is specific legal provisions are represented by mP _ partial, mP _ partial and mP _ partial.
Two common hierarchical multi-label classification algorithms, namely a TreeBoost.MH local algorithm and a Clus-HMC global algorithm, are selected respectively in the implementation and are compared with the prediction performance of the hierarchical multi-label classification algorithm, the performance comparison of the hierarchical multi-label classification algorithm on each hierarchical index is given in a table 5, and the prediction performance comparison of the hierarchical multi-label classification algorithm on each macro-average index is given in a table 6.
Table 5 comparison of hierarchical index performance of each algorithm:
Figure GDA0002300851710000181
Figure GDA0002300851710000191
table 6 macro-average performance comparison of algorithms:
the fact proves that the multi-label classification algorithm of the level can achieve better prediction performance than the existing method. By combining the characteristic that the Lazy-HMC algorithm supports incremental learning, an effective and applicable automatic case law identification system can be constructed by utilizing the Lazy-HMC algorithm.
The present invention provides a hierarchical multi-label classification method suitable for legal identification, and a plurality of methods and ways for implementing the technical scheme, and the above description is only a preferred embodiment of the present invention, it should be noted that, for those skilled in the art, a plurality of improvements and decorations can be made without departing from the principle of the present invention, and these improvements and decorations should also be regarded as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims (2)

1. A hierarchical multi-label classification method suitable for legal identification is characterized by comprising the following steps:
step 1, acquiring an original text data set of a referee document, dividing the original text data set into a training sample set and a testing sample set, and preprocessing: extracting case facts and applicable legal provisions thereof from the case facts according to a literary structure of a referee document, wherein the case facts are used for generating feature vectors of case samples, the case samples comprise training samples and testing samples, the applicable legal provisions are used for representing class labels of the case samples, and an original text data set is converted into a semi-structured multi-label training sample set and a multi-label testing sample set; performing word segmentation and part-of-speech tagging on case fact description;
step 2, expanding legal provisions corresponding to case facts of all case samples based on a hierarchical structure of a label space formed by category labels in a multi-label training sample set, so that the category label corresponding to each case fact is a subset of the label space and meets the hierarchical limitation;
step 3, performing feature selection on the word segmentation result in the step 1, and selecting feature words capable of representing case facts to construct feature vectors; obtaining a structured extended multi-label training sample set Tr and an extended multi-label testing sample set Te through text representation;
step 4, constructing a prediction model: finding k neighbor sample sets N' (x) of unseen examples x from an extended multi-label test sample set Te in an extended multi-label training sample set Tr, wherein the unseen examples are the case facts to be classified, setting weight for each neighbor sample, calculating confidence coefficients of the unseen examples x belonging to each class according to the classification weight of the k neighbor samples to each class, predicting class label sets h (x) of the unseen examples x, wherein h (x) meets the hierarchical constraint, and finally removing the hierarchical constraint in the class label sets h (x) of the unseen examples x according to the tree structure of a label space to obtain the specific applicable legal provisions of the unseen examples;
in the step 1, randomly dividing an original text data set of a referee document into a training sample set and a testing sample set according to the proportion of 7: 3;
the step 2 comprises the following steps:
step 2-1, in the hierarchical multi-label classification problem, a d-dimensional instance space is given
Figure FDA0002300851700000011
Figure FDA0002300851700000012
Is a real number set and contains q classes of label space Y ═ Y1,y2,…,yq},yvThe label of the v-th class is represented, and v is more than or equal to 1 and less than or equal to q, then the class label space hierarchy uses the binary group
Figure FDA0002300851700000017
It is shown that,
Figure FDA0002300851700000018
shows the partial order relationship of the category label if y existsv,yuIs epsilon of Y and
Figure FDA0002300851700000019
then the category label yvBelongs to category label yu,yvIs yuDescendant class tag of, yuIs yvThe classifier must satisfy the hierarchical constraint on the set of class labels h (x) predicted to miss instance x, i.e.,
Figure FDA0002300851700000013
and y '< y': y ', (x) where y' is the class label in h (x) and y 'is an ancestor class label for y';
step 2-2, for each multi-labeled case sample (x)i,hi) I is more than or equal to 1 and less than or equal to m, m is the number of all the obtained referee document samples, xie.X' is a feature vector with d dimension for representing the fact part of the case,is equal to xiA corresponding set of class labels, i.e. xiCorresponding legal provision, the expanded category label set is hi′,
Figure FDA0002300851700000015
Then h isiIn which h is includediAll class labels in (1) and all ancestor class labels thereof:
Figure FDA0002300851700000016
wherein y' is hiThe ancestor class label of class label y in (1), y ∈ hi
The step 3 comprises the following steps:
step 3-1, converting multi-label sample data intoProcessing the single-label sample data: for each multi-labeled case sample (x)i,hi) (i is more than or equal to 1 and less than or equal to m) using | hiI represents a category label set h of a multi-labeled case sampleiThe number of label categories in the label list is replaced by | hiL single-label case sample (x)i′,yi′)(1≤i′≤|hi|,yi′∈hi) Class label y for each single label samplei′I.e. the category label set hiA category label of; the multi-label case samples comprise multi-label training samples and multi-label testing samples;
step 3-2, through the conversion process of step 3-1, the multi-label case samples are converted into a plurality of single-label case samples, the case fact part of each single-label case sample is regarded as a document with words already segmented, the case fact parts of all single-label case samples form a document set, and the feature weight tf-idf of the jth 'dimension feature in the ith' document in the document seti″j″The definition is as follows:
Figure FDA0002300851700000021
wherein tf-idfi″j″Representation feature word tj″In document di″Frequency of occurrence of idfj″Representation feature word tj″Inverse document frequency in a document collection, N represents the total number of documents in the document collection, Nj″Representation feature word tj″Frequency of documents in a document set, i.e. occurrence of a characteristic word t in a document setj″The denominator of the number of documents is a normalization factor;
3-3, performing feature selection on the word segmentation result in the step 1 by using an information gain algorithm or a chi-square statistical algorithm, and selecting a certain number of feature words with distinguishing capability to form a feature space;
and (3) selecting features by adopting an information gain algorithm: the information gain ig (t) of the feature word t is defined as follows:
Figure FDA0002300851700000022
wherein, p (y)v) Presentation category label yvProbability of occurrence, p (t) represents probability of occurrence of feature word t, p (y)vI t) indicates the category label y on the premise that the feature word t appearsvThe probability of occurrence of the event is,
Figure FDA0002300851700000023
indicates the probability that the feature word t does not occur,
Figure FDA0002300851700000024
indicating a category label y on the premise that the feature word t does not appearvCalculating the information gain of each feature word in the document set according to the occurrence probability, wherein the feature words with the information gain value lower than a set threshold value are not included in a feature space;
and (3) selecting features by adopting a chi-square statistical algorithm:
let A be a label y containing a feature word t and belonging to a categoryv1 < v < q, B is a label y containing a feature word t but not belonging to a categoryvC is a label y belonging to a category without including a feature word tvD is a label y which does not contain the feature word t and does not belong to the categoryvN is the total number of documents in the document set, then the characteristic word t and the category label yvChi square statistic of2(t,yv) Is defined as:
Figure FDA0002300851700000031
characteristic word t and category yvWhen independent, the chi-square statistic is 0, the chi-square statistic about each category is calculated for one feature word, and then the mean χ is calculated respectively2 avg(t) and maximum value
Figure FDA0002300851700000032
By Chi2 avg(t)And
Figure FDA0002300851700000033
comprehensively considering, selecting a certain number of characteristic words with distinguishing capability, wherein p (y)v) Presentation category label yvProbability of occurrence:
Figure FDA0002300851700000034
Figure FDA0002300851700000035
in step 4, when k neighbors are found, no example x and no neighbor sample (x) in the extended multi-label training sample set are founda,ha) Distance d (x, x)a) Wherein (x)a,ha)∈N′(x),1≤a≤k,haIs xaAnd calculating the reciprocal of cosine similarity of the feature vectors of the corresponding class labels, wherein the cosine similarity cos (gamma, lambda) of the feature vector gamma of the example x and the feature vector lambda of the adjacent sample is calculated by the following formula:
Figure FDA0002300851700000036
where S denotes the index of the component of the feature vector, i.e. the position of the component in the feature vector, S denotes the dimension of the feature vector, γsDenotes the s-th component, λ, of the feature vector γsRepresenting the s-th component of the feature vector lambda.
2. The method of claim 1, wherein: in step 4, d (x, x) is useda) Indicating that example x is not found and the neighboring samples (x) in the extended multi-labeled training sample seta,ha) The distance of the extended multi-label training sample set is calculated by adopting a full label distance weight method or an entropy label distance weight method to haClass label y in (1)jBy classification weight waj,1≤j≤q;
Calculation of w by full-label distance weight methodaj
Figure FDA0002300851700000037
Calculation of w by entropy label distance weight methodaj
Figure FDA0002300851700000038
Unseen instance x belongs to the class label yjConfidence of (c) (x, y)j) The calculation formula is as follows:
Figure FDA0002300851700000041
wherein warRepresents haR class label yrThe classification weight of (2);
the class label set h (x) for the prediction of missing instance x is:
Figure FDA0002300851700000042
and selecting 0.5 as a decision threshold, and when the confidence of the unseen example x belonging to each class label is smaller than the decision threshold, returning the class label with the maximum confidence as the class label of the unseen example.
CN201710832304.8A 2017-09-15 2017-09-15 Hierarchical multi-label classification method suitable for legal identification Active CN107577785B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710832304.8A CN107577785B (en) 2017-09-15 2017-09-15 Hierarchical multi-label classification method suitable for legal identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710832304.8A CN107577785B (en) 2017-09-15 2017-09-15 Hierarchical multi-label classification method suitable for legal identification

Publications (2)

Publication Number Publication Date
CN107577785A CN107577785A (en) 2018-01-12
CN107577785B true CN107577785B (en) 2020-02-07

Family

ID=61035969

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710832304.8A Active CN107577785B (en) 2017-09-15 2017-09-15 Hierarchical multi-label classification method suitable for legal identification

Country Status (1)

Country Link
CN (1) CN107577785B (en)

Families Citing this family (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304386A (en) * 2018-03-05 2018-07-20 上海思贤信息技术股份有限公司 A kind of logic-based rule infers the method and device of legal documents court verdict
CN108334500B (en) * 2018-03-05 2022-02-22 上海思贤信息技术股份有限公司 Referee document labeling method and device based on machine learning algorithm
CN110245907A (en) * 2018-03-09 2019-09-17 北京国双科技有限公司 The generation method and device of court's trial notes content
CN108664924B (en) * 2018-05-10 2022-07-08 东南大学 Multi-label object identification method based on convolutional neural network
CN108763361A (en) * 2018-05-17 2018-11-06 南京大学 A kind of multi-tag taxonomy model method based on topic model
CN110895703B (en) * 2018-09-12 2023-05-23 北京国双科技有限公司 Legal document case recognition method and device
CN110909157B (en) * 2018-09-18 2023-04-11 阿里巴巴集团控股有限公司 Text classification method and device, computing equipment and readable storage medium
CN111126053B (en) * 2018-10-31 2023-07-04 北京国双科技有限公司 Information processing method and related equipment
CN109543178B (en) * 2018-11-01 2023-02-28 银江技术股份有限公司 Method and system for constructing judicial text label system
CN109685158B (en) * 2019-01-08 2020-10-16 东北大学 Clustering result semantic feature extraction and visualization method based on strong item set
CN109919368B (en) * 2019-02-26 2020-11-17 西安交通大学 Law recommendation prediction system and method based on association graph
CN109961094B (en) * 2019-03-07 2021-04-30 北京达佳互联信息技术有限公司 Sample acquisition method and device, electronic equipment and readable storage medium
CN110046256A (en) * 2019-04-22 2019-07-23 成都四方伟业软件股份有限公司 The prediction technique and device of case differentiation result
CN110163849A (en) * 2019-04-28 2019-08-23 上海鹰瞳医疗科技有限公司 Training data processing method, disaggregated model training method and equipment
CN110245229B (en) * 2019-04-30 2023-03-28 中山大学 Deep learning theme emotion classification method based on data enhancement
CN110135592B (en) * 2019-05-16 2023-09-19 腾讯科技(深圳)有限公司 Classification effect determining method and device, intelligent terminal and storage medium
CN110287287B (en) * 2019-06-18 2021-11-23 北京百度网讯科技有限公司 Case prediction method and device and server
CN110347839B (en) * 2019-07-18 2021-07-16 湖南数定智能科技有限公司 Text classification method based on generative multi-task learning model
CN110633365A (en) * 2019-07-25 2019-12-31 北京国信利斯特科技有限公司 Word vector-based hierarchical multi-label text classification method and system
CN110442722B (en) * 2019-08-13 2022-05-13 北京金山数字娱乐科技有限公司 Method and device for training classification model and method and device for data classification
CN110543634B (en) * 2019-09-02 2021-03-02 北京邮电大学 Corpus data set processing method and device, electronic equipment and storage medium
CN110825879B (en) * 2019-09-18 2024-05-07 平安科技(深圳)有限公司 Decide a case result determination method, device, equipment and computer readable storage medium
CN110751188B (en) * 2019-09-26 2020-10-09 华南师范大学 User label prediction method, system and storage medium based on multi-label learning
CN110851596B (en) * 2019-10-11 2023-06-27 平安科技(深圳)有限公司 Text classification method, apparatus and computer readable storage medium
CN110968693A (en) * 2019-11-08 2020-04-07 华北电力大学 Multi-label text classification calculation method based on ensemble learning
CN110837735B (en) * 2019-11-17 2023-11-03 内蒙古中媒互动科技有限公司 Intelligent data analysis and identification method and system
US11379758B2 (en) 2019-12-06 2022-07-05 International Business Machines Corporation Automatic multilabel classification using machine learning
CN111143569B (en) * 2019-12-31 2023-05-02 腾讯科技(深圳)有限公司 Data processing method, device and computer readable storage medium
CN110781650B (en) * 2020-01-02 2020-04-14 四川大学 Method and system for automatically generating referee document based on deep learning
CN111540468B (en) * 2020-04-21 2023-05-16 重庆大学 ICD automatic coding method and system for visualizing diagnostic reasons
CN111738303B (en) * 2020-05-28 2023-05-23 华南理工大学 Long-tail distribution image recognition method based on hierarchical learning
CN111723208B (en) * 2020-06-28 2023-04-18 西南财经大学 Conditional classification tree-based legal decision document multi-classification method and device and terminal
CN111930944B (en) * 2020-08-12 2023-08-22 中国银行股份有限公司 File label classification method and device
CN112464973B (en) * 2020-08-13 2024-02-02 浙江师范大学 Multi-label classification method based on average distance weight and value calculation
CN112016430B (en) * 2020-08-24 2022-10-11 郑州轻工业大学 Hierarchical action identification method for multi-mobile-phone wearing positions
CN111737479B (en) * 2020-08-28 2020-11-17 深圳追一科技有限公司 Data acquisition method and device, electronic equipment and storage medium
CN112182213B (en) * 2020-09-27 2022-07-05 中润普达(十堰)大数据中心有限公司 Modeling method based on abnormal lacrimation feature cognition
CN112131884B (en) * 2020-10-15 2024-03-15 腾讯科技(深圳)有限公司 Method and device for entity classification, method and device for entity presentation
CN112232524B (en) * 2020-12-14 2021-06-29 北京沃东天骏信息技术有限公司 Multi-label information identification method and device, electronic equipment and readable storage medium
CN113407727B (en) * 2021-03-22 2023-01-13 天津汇智星源信息技术有限公司 Qualitative measure and era recommendation method based on legal knowledge graph and related equipment
CN114117040A (en) * 2021-11-08 2022-03-01 重庆邮电大学 Text data multi-label classification method based on label specific features and relevance
CN114860892B (en) * 2022-07-06 2022-09-06 腾讯科技(深圳)有限公司 Hierarchical category prediction method, device, equipment and medium
CN117216688B (en) * 2023-11-07 2024-01-23 西南科技大学 Enterprise industry identification method and system based on hierarchical label tree and neural network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199857A (en) * 2014-08-14 2014-12-10 西安交通大学 Tax document hierarchical classification method based on multi-tag classification
CN104881689A (en) * 2015-06-17 2015-09-02 苏州大学张家港工业技术研究院 Method and system for multi-label active learning classification
CN105868773A (en) * 2016-03-23 2016-08-17 华南理工大学 Hierarchical random forest based multi-tag classification method
CN106126972A (en) * 2016-06-21 2016-11-16 哈尔滨工业大学 A kind of level multi-tag sorting technique for protein function prediction

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150161198A1 (en) * 2013-12-05 2015-06-11 Sony Corporation Computer ecosystem with automatically curated content using searchable hierarchical tags

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199857A (en) * 2014-08-14 2014-12-10 西安交通大学 Tax document hierarchical classification method based on multi-tag classification
CN104881689A (en) * 2015-06-17 2015-09-02 苏州大学张家港工业技术研究院 Method and system for multi-label active learning classification
CN105868773A (en) * 2016-03-23 2016-08-17 华南理工大学 Hierarchical random forest based multi-tag classification method
CN106126972A (en) * 2016-06-21 2016-11-16 哈尔滨工业大学 A kind of level multi-tag sorting technique for protein function prediction

Also Published As

Publication number Publication date
CN107577785A (en) 2018-01-12

Similar Documents

Publication Publication Date Title
CN107577785B (en) Hierarchical multi-label classification method suitable for legal identification
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN107798033B (en) Case text classification method in public security field
CN112256939B (en) Text entity relation extraction method for chemical field
CN108009135B (en) Method and device for generating document abstract
CN112632228A (en) Text mining-based auxiliary bid evaluation method and system
CN111832289A (en) Service discovery method based on clustering and Gaussian LDA
WO2020063071A1 (en) Sentence vector calculation method based on chi-square test, and text classification method and system
Joshi et al. Categorizing the document using multi class classification in data mining
CN116501875B (en) Document processing method and system based on natural language and knowledge graph
CN114265935A (en) Science and technology project establishment management auxiliary decision-making method and system based on text mining
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN112836029A (en) Graph-based document retrieval method, system and related components thereof
Gao et al. A maximal figure-of-merit (MFoM)-learning approach to robust classifier design for text categorization
Ikram et al. Arabic text classification in the legal domain
CN116128544A (en) Active auditing method and system for electric power marketing abnormal business data
Alsaidi et al. English poems categorization using text mining and rough set theory
CN113590827B (en) Scientific research project text classification device and method based on multiple angles
Abdollahpour et al. Image classification using ontology based improved visual words
CN112270189B (en) Question type analysis node generation method, system and storage medium
Balaneshin-kordan et al. Sequential query expansion using concept graph
Hamdi et al. Machine learning vs deterministic rule-based system for document stream segmentation
Xiao et al. Revisiting table detection datasets for visually rich documents
Wang et al. A Method of Hot Topic Detection in Blogs Using N-gram Model.
Zhang et al. Extending associative classifier to detect helpful online reviews with uncertain classes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant