CN115952290B

CN115952290B - Case characteristic labeling method, device and equipment based on active learning and semi-supervised learning

Info

Publication number: CN115952290B
Application number: CN202310218333.0A
Authority: CN
Inventors: 万玉晴; 吕灏; 苏超; 蒋东来
Original assignee: Taiji Computer Corp Ltd
Current assignee: Taiji Computer Corp Ltd
Priority date: 2023-03-09
Filing date: 2023-03-09
Publication date: 2023-06-02
Anticipated expiration: 2043-03-09
Also published as: CN115952290A

Abstract

The invention relates to a case feature labeling method, a device and equipment based on active learning and semi-supervised learning, which belong to the technical field of intelligent judicial, wherein the method, the device and the equipment select partial samples with the biggest benefits by using an active learning strategy and give the samples to a law expert for labeling, select partial sample expansion training sets with the highest confidence coefficient by using a semi-supervised learning strategy, and perform multi-level multi-label case feature labeling after multiple iterations; the method combines the advantages of active learning and semi-supervised learning, can obtain larger annotation benefits and more high-quality training samples only by annotating less data, and can model the hierarchical structure and semantic relation of the case label, thereby solving the problems of overhigh cost and long tail effect of all manual annotations in the current case feature annotation and improving the annotation efficiency and accuracy.

Description

Case characteristic labeling method, device and equipment based on active learning and semi-supervised learning

Technical Field

The invention relates to the technical field of intelligent judicial, in particular to a case characteristic labeling method, device and equipment based on active learning and semi-supervised learning.

Background

With the continuous penetration of court informatization and intelligent court construction, the court accumulates a large amount of digital historical case file materials. The materials provide data support for the development of judicial intelligent application, and also provide higher requirements for intelligent judicial technology. The case feature labeling is an important bottom technical tool for realizing intelligent judicial related application, legal professions and judicial business are introduced into the case volume corpus as priori knowledge through cognitive learning of the volume text, and the obtained label information has important promotion effects on the capabilities of case semantic understanding, text semantic deconstructing, field language model optimizing, sample information enhancing and the like, can effectively improve the accuracy of a calculation model in related application, and has important significance on the floor application of numerous judicial case knowledge services such as case retrieval, document generation and the like.

In the related technology, aiming at massive historical case data, a classification model is trained by manually labeling samples, so that case feature labeling is realized. However, due to the large number of case types, the label system for the grooming is huge, and all training samples need to be marked by legal professionals. Therefore, the cost for obtaining the labeling sample is high, and the requirement of training the deep learning model on the labeling sample is difficult to meet.

Therefore, how to quickly label the case features and reduce the sample labeling cost becomes a technical problem to be solved in the prior art.

Disclosure of Invention

In view of the above, the present invention aims to provide a case feature labeling method, device and equipment based on active learning and semi-supervised learning, so as to solve the problem that the cost of the labeling sample is high and the requirement of training a deep learning model on the labeling sample is difficult to meet.

In order to achieve the above purpose, the invention adopts the following technical scheme:

on one hand, the case feature labeling method based on active learning and semi-supervised learning comprises the following steps:

acquiring a case fact text to be marked;

word segmentation processing is carried out on the case fact text to be marked, and all obtained segmented words are used as target case facts;

inputting the target case facts into a case feature labeling model to obtain case feature labeling results of the target case facts; the case characteristic labeling result comprises: the hierarchical relationship of the case characteristics;

the case feature labeling model is obtained by performing active learning and semi-supervised learning pre-training according to part of labeled sample case facts and part of unlabeled sample case facts; and the case characteristics of the marked sample case facts are marked with a hierarchical relationship.

Optionally, the method further comprises:

acquiring a case fact text of each referee document in the referee document set, performing word segmentation processing on each case fact text, and taking all obtained segmented words as a sample set;

extracting a sample to be marked from the sample set, constructing a sample set to be marked, and constructing the rest unlabeled sample into an unlabeled sample set; labeling the case characteristics of each sample to be labeled in the sample set to be labeled and the hierarchical relationship of the case characteristics to obtain a labeled sample set in response to a labeling instruction of an expert; wherein each case feature corresponds to a tag;

model training is carried out based on the marked sample set and a preset model, and a preliminary case feature marking model is obtained;

inputting unlabeled samples in the unlabeled sample set into the preliminary case feature labeling model to obtain labeling results of each unlabeled sample, wherein the labeling results comprise: a classification probability for each tag;

calculating the confidence coefficient corresponding to each unlabeled sample according to the classification probability;

extracting unlabeled samples with low confidence in the unlabeled sample set to be used as sample sets to be labeled again according to the confidence, and labeling the case characteristics and the hierarchical relationship of the case characteristics of each sample to be labeled in the sample sets to be labeled in response to labeling instructions of experts to obtain labeled sample sets; extracting unlabeled samples with high confidence from the unlabeled samples, taking the unlabeled samples with high confidence and corresponding labeling results as labeled samples, and adding the labeled samples into the labeled sample set; and carrying out iterative updating on the preliminary case feature labeling model until the iterative times of the preliminary case feature labeling model reach an iterative times threshold value, so as to obtain the case feature labeling model.

Optionally, extracting the unlabeled sample with low confidence in the unlabeled sample set as the sample set to be labeled again includes:

sequencing the unlabeled samples according to the sequence of the confidence from high to low;

and extracting unlabeled samples arranged in a first threshold number, and reconstructing the sample set to be labeled.

Optionally, the extracting an unlabeled sample with high confidence from the unlabeled samples, taking the unlabeled sample with high confidence and a corresponding labeling result as a labeled sample, and adding the labeled sample set includes:

extracting unlabeled samples with a second threshold number and corresponding labeling results which are arranged in front of the sample to serve as labeled samples, and adding the labeled samples into the labeled sample set.

Optionally, the noted sample includes: a case fact sample and a corresponding feature tag;

the model training based on the marked sample set and a preset model comprises the following steps:

acquiring word vectors of the case fact samples, inputting the word vectors into a bidirectional long-short-time memory network to obtain case fact sample association relations, inputting the case fact sample association relations into a preset convolutional neural network to obtain various features, and inputting the various features into a full-connection layer to obtain semantic vectors of various case facts; the method comprises the steps of,

Obtaining a tag vector of the feature tag, inputting the hierarchical relationship of the tag vector and the feature tag into a bidirectional tree structure long-term and short-term memory network, and inputting an output result into a full-connection layer to obtain a semantic vector of each tag;

the semantic vector of each case fact and the semantic vector of each label are subjected to an attention network, so that the weight of each case fact corresponding to different labels is obtained;

according to the weights of different labels corresponding to each case fact, labeling results of the case facts are obtained; the standard result of the case facts comprises the probability of the label corresponding to each case fact and the hierarchical relationship of the labels.

Optionally, the preset model includes: a case fact encoder, a tag structure encoder, and an attention network; the case fact encoder comprises a two-way long-short-term memory network and a convolutional neural network which are connected with each other; the tag structure encoder comprises a bidirectional tree structure long-term and short-term memory network.

Optionally, the word segmentation processing for the case fact text to be marked includes: and according to an LTP tool or a crust word segmentation tool, word segmentation processing is carried out on the case fact text to be marked.

Optionally, the calculating the confidence level corresponding to each unlabeled sample according to the classification probability includes:

inputting the classification probability into a confidence coefficient calculation formula to obtain a corresponding confidence coefficient; the confidence coefficient calculation formula is as follows:

wherein y is _j For the j-th tag to be the (th) tag,

sample x predicted for model has tag y _j N is the number of tags.

In yet another aspect, a case feature labeling apparatus based on active learning and semi-supervised learning includes:

the acquiring module is used for acquiring the case fact text to be marked;

the word segmentation module is used for carrying out word segmentation processing on the case fact text to be marked, and taking all obtained segmented words as target case facts;

the input labeling module is used for inputting the target case facts into a case feature labeling model to obtain case feature labeling results of the target case facts; the case characteristic labeling result comprises: the hierarchical relationship of the case characteristics;

In yet another aspect, a case feature labeling device based on active learning and semi-supervised learning includes a processor and a memory, where the processor is connected to the memory:

the processor is used for calling and executing the program stored in the memory;

the memory is used for storing the program, and the program is at least used for executing the case characteristic labeling method based on active learning and semi-supervised learning.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

after acquiring a case fact text to be marked, obtaining a target case fact through word segmentation; inputting the target case facts into a case feature labeling model to obtain case feature labeling results of the target case facts; the case characteristic labeling result comprises: the hierarchical relationship of the case characteristics; the case feature labeling model is obtained by performing active learning and semi-supervised learning pre-training according to part of labeled sample case fact texts and part of unlabeled sample case fact texts; and the case characteristics of the marked sample case fact text are marked with a hierarchical relationship. Therefore, in the method, the part of samples with the biggest benefits are selected by using an active learning strategy and are submitted to a law expert for marking, the part of samples with the highest confidence level are selected by using a semi-supervised learning strategy to expand a training set, and multi-level and multi-label case feature marking is performed after multiple iterations; the method combines the advantages of active learning and semi-supervised learning, can obtain larger annotation benefits and more high-quality training samples only by annotating less data, and can model the hierarchical structure and semantic relation of the case label, thereby solving the technical problem that the cost of all manual annotations is too high in the current case feature annotation and improving the annotation efficiency and accuracy.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a case feature labeling method based on active learning and semi-supervised learning according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a method for constructing a case feature labeling model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a schematic architecture of a case feature labeling model according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a case feature labeling device based on active learning and semi-supervised learning according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a case feature labeling device based on active learning and semi-supervised learning according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, based on the examples herein, which are within the scope of the invention as defined by the claims, will be within the scope of the invention as defined by the claims.

As described in the background art, in the related art, aiming at massive historical case data, a classification model is trained by manually labeling samples, so that case feature labeling is realized. However, due to the large number of case types, the label system for the grooming is huge, and all training samples need to be marked by legal professionals. Therefore, the cost for obtaining the labeling sample is high, and the requirement of training the deep learning model on the labeling sample is difficult to meet.

In the face of massive historical case data, automatic or semi-automatic mode is needed to label the case features, and the purpose of case feature labeling is achieved by manually labeling samples, training a classification model and currently based on text classification technology. The key of the technical scheme is the constructed label system and the quality and quantity of the labeling samples. In the prior art, a case label system lacks consideration of accurate hierarchical relation, and the case characteristics of complex cases are difficult to fully express. Based on the method, a multi-level label system is constructed for precisely modeling the complex case according to the semantic association and the hierarchical relationship among the case labels of the specific case.

In addition, even though the case is of the same type, the distribution of the case label samples has obvious long tail effect, namely, some labels appear in most cases, and other labels rarely appear, and the unbalanced distribution of the labels greatly increases the extraction difficulty of the labeling samples, and causes serious data bias, so that a trained model is difficult to converge to an optimal solution. The invention adopts active learning to alleviate the problems, reduces the manual labeling cost, does not need to label all samples, and iteratively selects samples which are most beneficial to model training for labeling, thereby obtaining the maximum labeling income with the minimum labeling cost.

Specifically, the embodiment of the invention provides a case characteristic labeling method, device and equipment based on active learning and semi-supervised learning.

Fig. 1 is a flow chart of a case feature labeling method based on active learning and semi-supervised learning according to an embodiment of the present invention, referring to fig. 1, the embodiment may include the following steps:

s11, acquiring a case fact text to be marked;

step S12, word segmentation processing is carried out on the case fact text to be marked, and all obtained segmented words are used as target case facts;

S13, inputting the target case facts into a case feature labeling model to obtain case feature labeling results of the target case facts; the case characteristic labeling result comprises: the hierarchical relationship of the case characteristics;

the case characteristic labeling model is obtained by performing active learning and semi-supervised learning pre-training according to part of labeled sample case facts and part of unlabeled sample case facts; the case characteristics of the marked sample case facts are marked with hierarchical relations.

It should be noted that, in the technical solution provided in this embodiment, the execution body of the method may be any controller having data and instruction processing functions, for example, the controller may be a PLC, a single chip microcomputer, or the like; the controller may be provided in any electronic device, for example, the electronic device may be: intelligent terminals, telephone watches, calculators, servers, etc.

The server may be an electronic device with a certain arithmetic processing capability. Which may have a network communication module, a processor, memory, and the like. Of course, the server may also refer to software running in the electronic device. The server may also be a distributed server, and may be a system having multiple processors, memories, network communication modules, etc. operating in concert. Alternatively, the server may be a server cluster formed for several servers. Or, with the development of science and technology, the server may also be a new technical means capable of realizing the corresponding functions of the embodiment of the specification. For example, a new form of "server" based on quantum computing implementation may be possible.

In a specific case feature labeling process based on active learning and semi-supervised learning, a case fact part can be extracted from any judge document needing case feature labeling to serve as a case fact text to be labeled. The judge document can be obtained from a case officer network, can be obtained from any storage address, and can be automatically collected by using a crawler module by utilizing a requests library and a BeautifullSoup library based on Python language. After the judge document is obtained, the case fact text can be extracted from the judge document through character recognition.

After obtaining the text of the case facts to be marked, word segmentation processing is carried out on the text, and all the obtained segmented words are used as target case facts. The word segmentation method can be carried out according to tools such as a language technology platform LTP or a barking word segmentation method.

Inputting the target case facts into a pre-constructed case feature labeling model for feature labeling, so as to obtain a case feature labeling result; the case characteristic labeling result comprises: the case characteristics and the hierarchical relationship of the case characteristics. Each case feature is a label, and the hierarchical relationship of the case features is the hierarchical relationship of the labels.

It is worth to say that, the case feature labeling model in the implementation is obtained by performing active learning and semi-supervised learning pre-training according to part of labeled sample case facts and part of unlabeled sample case facts; the case characteristics of the marked sample case facts are marked with hierarchical relations.

It can be understood that, by adopting the technical scheme of the embodiment, after acquiring the text of the case facts to be marked, the target case facts are obtained through word segmentation; inputting the target case facts into a case feature labeling model to obtain case feature labeling results of the target case facts; the case characteristic labeling result comprises: the hierarchical relationship of the case characteristics; the case feature labeling model is obtained by performing active learning and semi-supervised learning pre-training according to a part of labeled sample case fact texts and a part of unlabeled sample case fact texts; the case characteristics of the marked sample case fact text are marked with hierarchical relations. Therefore, in the method, the part of samples with the biggest benefits are selected by using an active learning strategy and are submitted to a law expert for marking, the part of samples with the highest confidence level are selected by using a semi-supervised learning strategy to expand a training set, and multi-level and multi-label case feature marking is performed after multiple iterations; the method combines the advantages of active learning and semi-supervised learning, can obtain larger annotation benefits and more high-quality training samples only by annotating less data, and can model the hierarchical structure and semantic relation of the case label, thereby solving the technical problems of overhigh cost and long tail effect of all manual annotations in the current case feature annotation and improving the annotation efficiency and accuracy.

In order to further explain the technical scheme of the invention, an embodiment of the invention also provides a construction process of the case feature labeling model.

Fig. 2 is a schematic flow chart of a method for constructing a case feature labeling model according to an embodiment of the present invention. Specifically, referring to fig. 2, the construction process of the case feature labeling model may include the following steps:

step S21, acquiring the case fact text of each referee document in the referee document set, performing word segmentation on each case fact text, and taking all obtained segmented words as a sample set.

For example, a crawler module may be implemented using a Python language-based requests library and beautfulso library to automatically collect different referee documents, thereby forming a referee document set. Because of the large number of judge documents, the judge documents can be batched according to the types of cases, and one or more types of documents can be collected and marked at a time. The case types can be classified into civil loan disputes, theft crimes and the like. In the following description, the present embodiment will take civil lending disputes as an example, and explain a case feature labeling method based on active learning; when the case facts are segmented, a word segmentation tool can select a Language Technology Platform (LTP). And taking the obtained segmentation as a sample set.

S22, extracting a sample to be marked from a sample set, constructing the sample set to be marked, and constructing the rest unlabeled sample into an unlabeled sample set; labeling case characteristics of each sample to be labeled in the sample set to be labeled and the hierarchical relationship of the case characteristics in response to the labeling instruction of the expert to obtain a labeled sample set; wherein each case feature corresponds to a tag.

It should be noted that, in this implementation, a case label system may be constructed in advance. For each case type, a set of case labels can be combed by a law expert, and the set of labels comprises all the case characteristics of the case type, wherein each label represents one case characteristic, the labels have a hierarchical relationship, and one case possibly corresponds to a plurality of labels, namely, the case has a plurality of characteristics. And finally, forming a case label system by all the case labels. The accuracy of labeling is improved by adopting priori knowledge of law specialists.

For example, taking the civil loan case as an example, under the dispute focus label, the hierarchical relationship of the label may be: the first-level label is the qualification of a litigation subject, the second-level label is inconsistent with an actual payer, inconsistent with an actual borrower, whether a borrower is a qualified original report, whether a borrower spouse is a qualified report and the like; the primary label is a contract main body dispute, the secondary label is a right debt transfer, the tertiary label is a debt bearing, a third person joins the debt, and the like. Under the case by hierarchy label, the hierarchy relationship of the label may be: the first-level label is a civil case, and the second-level label can be a person check right dispute, a marital family, an inherited dispute, a material right dispute, a contract, a dimensionless management, an improper advantage dispute and the like. Under the condition of right-of-things disputes, contracts, dimensionless management and improper-favor disputes, the second-level label can be borrowed contract disputes in the contract disputes, and the third-level label can be inter-civil borrowing disputes.

In a specific real-time process, a certain number of samples to be marked (for example, 1000-3000 samples) can be extracted from the sample set each time and used as the sample set to be marked, and the rest unmarked samples are constructed as unmarked sample sets. Can be expressed as:

extracting a sample T= { T to be marked from a sample set S ₁ , T ₂ , T ₃ , …, T _n And (2) selecting the number of samples to be marked each time, namely, constructing the number of samples to be marked in the sample set to be marked.

After the sample to be marked is extracted, identifying the case characteristics and the hierarchical relationship of the sample to be marked in the sample set by a legal expert, and performing manual marking. The specific implementation manner may be that a sample to be marked is sent to a law expert, and the law expert sends a marking instruction (carrying marking information), so that the instruction is responded to mark the case characteristics of the sample to be marked and the hierarchical relationship of the case characteristics. One label for each case feature.

And S23, performing model training based on the marked sample set and a preset model to obtain a preliminary case feature marking model.

And carrying out model training on the marked sample set and the preset model, thereby obtaining a preliminary case feature marking model. It should be noted that, the labeled sample set may be divided into a training set, a verification set, and a test set, so as to perform model training, verification, and testing. Wherein the validation set and the test set should cover as much of the tags under that case type as possible.

In the training process of the primary case feature labeling model, in the first iteration process, the labeled sample is taken as a labeled sample set, and then is marked as a labeled sample set L. In the iterative training process of the model, if the iteration is not the first round of iteration, the currently marked sample T is added into the training set, and the currently marked sample T is removed from the unmarked sample set U.

Step S24, inputting unlabeled samples in the unlabeled sample set into a preliminary case feature labeling model to obtain labeling results of each unlabeled sample, wherein the labeling results comprise: classification probability for each tag.

After the primary case feature labeling model is obtained through training, the case facts in the unlabeled sample set are predicted by using the trained primary case feature labeling model, and labeling results of each unlabeled sample, namely, the classification probability of each label, are obtained.

And S25, calculating the confidence coefficient corresponding to each unlabeled sample according to the classification probability.

Specifically, according to the classification probability, calculating the confidence level of each unlabeled sample includes: inputting the classification probability into a confidence coefficient calculation formula to obtain a corresponding confidence coefficient; the confidence coefficient calculation formula is as follows:

Wherein y is _j For the j-th tag to be the (th) tag,

sample x predicted for model has tag y _j N is the number of tags.

Step S26, extracting unlabeled samples with low confidence in unlabeled sample sets according to the confidence, re-using the unlabeled samples as sample sets to be labeled, and labeling case features and the hierarchical relationship of the case features of each sample to be labeled in the sample sets to be labeled in response to labeling instructions of experts to obtain labeled sample sets; extracting unlabeled samples with high confidence from the unlabeled samples, taking the unlabeled samples with high confidence and corresponding labeling results as labeled samples, and adding the labeled samples into a labeled sample set; and carrying out iteration update on the preliminary case feature labeling model until the iteration times of the preliminary case feature labeling model reach the iteration times threshold value, so as to obtain the case feature labeling model.

After the confidence coefficient of each unlabeled sample is obtained through calculation, the unlabeled sample with low confidence coefficient can be extracted according to the confidence coefficient so as to be used as a sample set to be labeled again, and a legal expert carries out manual labeling; and the sample with high confidence coefficient is directly taken as a marked sample, the marked sample set is added, the number of manual marks is reduced, a loop training process is executed, and the preliminary case feature marking model is iteratively updated until the iteration times of the preliminary case feature marking model reach the iteration times threshold value, so that the case feature marking model is obtained. The iteration number threshold may be preset.

It can be understood that, by adopting the technical scheme provided by the embodiment, when the case feature labeling model is constructed, a part of samples with the biggest benefits are selected by using an active learning strategy and are delivered to a law expert for labeling, a part of samples with the highest confidence level are selected by using a semi-supervised learning strategy to expand a training set, and multi-level and multi-label case feature labeling is performed after multiple iterations; the method combines the advantages of active learning and semi-supervised learning, can obtain larger annotation benefits and more high-quality training samples only by annotating less data, and can model the hierarchical structure and semantic relation of the case label, thereby solving the technical problems of overhigh cost and long tail effect of all manual annotations in the current case feature annotation and improving the annotation efficiency and accuracy.

Based on the above embodiment, optionally, extracting, from the unlabeled sample set, an unlabeled sample with low confidence as the sample set to be labeled again includes:

sequencing unlabeled samples according to the sequence of the confidence from high to low;

and extracting unlabeled samples arranged in a first threshold number, and reconstructing a sample set to be labeled.

For example, a first threshold number of unlabeled samples may be extracted and ranked later (i.e., the first threshold of unlabeled samples with the greatest uncertainty is selected), and the set of samples to be labeled is reconstructed. The first threshold may be a specific value K, such as 100.

Can be expressed as:

wherein U is _S In order to select samples to be marked from unlabeled sample sets at the time, U is the current unlabeled sample set, C (x) is the confidence of the model in carrying out label identification on the sample x,

representation set->

The meaning of the function argmax is: />

The argument x, which represents the maximum value of the function f (x), is x ^* 。

Taking the first K samples with the largest uncertainty of the selected model as a sample set to be marked, namely

。

On the basis of the above embodiment, optionally, extracting an unlabeled sample with high confidence from the unlabeled samples, taking the unlabeled sample with high confidence and the corresponding labeling result as the labeled sample, and adding the labeled sample set, including:

extracting unlabeled samples with a second threshold number and corresponding labeling results which are arranged in front as labeled samples, and adding the labeled samples into a labeled sample set.

The second threshold may be a specific value, for example, may be 100. And selecting the first samples with the highest confidence and the labels thereof, adding the first samples into the marked training set as marked samples, and removing the part of samples from unmarked samples.

On the basis of the above embodiment, optionally, the noted sample includes: a case fact sample and a corresponding feature tag;

model training is performed based on the marked sample set and a preset model, and the model training method comprises the following steps:

acquiring word vectors of the case fact samples, inputting the word vectors into a bidirectional long-short-time memory network to obtain a case fact sample association relationship, inputting the case fact sample association relationship into a preset convolutional neural network to obtain various features, and inputting the various features into a full-connection layer to obtain semantic vectors of various case facts; the method comprises the steps of,

obtaining a tag vector of a feature tag, inputting the hierarchical relationship of the tag vector and the feature tag into a bidirectional tree structure long-term and short-term memory network, and inputting an output result into a full-connection layer to obtain a semantic vector of each tag;

On the basis of the above embodiment, optionally, the preset model includes: a case fact encoder, a tag structure encoder, and an attention network; the case fact encoder comprises a two-way long-short-term memory network and a convolutional neural network which are connected with each other; the label structure encoder comprises a bidirectional tree structure long-term and short-term memory network.

Specifically, fig. 3 is a schematic diagram of an architecture of a case feature labeling model according to an embodiment of the present invention, referring to fig. 3, the architecture of the case feature labeling model may include a case fact encoder 31, a tag structure encoder 32, and an attention network 33; the case fact encoder includes a two-way long and short term memory network 311 and a convolutional neural network 312 connected to each other; the tag structure encoder includes a bi-directional tree structure long and short term memory network 321.

Specifically, after the labeled sample is obtained, the labeled sample includes: a case fact sample and a corresponding feature tag. The word vector table can be queried through the case fact encoder, so that the word vector of the case fact sample after the case fact text word segmentation of each case is obtained, and the word vector is marked as E= { E ₁ , e ₂ , e ₃ , …, e _m M is the number of each word segment (case fact sample). The word vector table can adopt the existing word vector table, such as open source Chinese word vectors issued by Tencent AI Lab; or can be obtained by training on the collected referee document data set.

E is input into a 3-layer bidirectional Long Short-Term Memory network (Bi-LSTM), and the final hidden layer state of each layer can be formed by splicing the hidden layer states of the forward LSTM and the reverse LSTM, and can be calculated according to the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,

hidden at time t for layer i networkState of Tibet layer>

、/>

The LSTM units respectively represent the forward direction and the reverse direction of the ith layer, and the LSTM unit at the bottommost layer is input into a word vector and the state at the previous moment, namely +.>

。

Then, the LSTM state of the highest layer is input as an input to the convolutional neural network CNN, and features are extracted by the convolutional neural network CNN. In CNN architecture, the output state of a bidirectional LSTM

Firstly, a convolution layer is formed by a convolution kernel, and the j-th CNN unit extracts the following characteristics:

，

where f is a nonlinear activation function, representing a convolution operation, K _j Is a convolution kernel, b _j Is a bias term.

Alternatively, the pooling layer of the CNN structure may use a max-pooling technique. Each feature can be extracted to have the important feature of the maximum value after the sliding window with the size of w and the max-pulling operation, namely the j-th feature is as follows:

where max is the maximum operation and m is the number of input units, which is equal to the number of input word vectors. The final output of the CNN is that each feature passes through a full connection layer, namely

Wherein g is a nonlinear activation function, W is the weight of the full connection layer, and O= [ O ] ₁ , o ₂ , o ₃ , …, o _m ]B is the bias term.

The tag vector of the feature tag can be obtained by a tag structure encoder (for example, the tag vector can be obtained by inquiring a word vector table), and the case tag hierarchical structure is encoded by adopting a bidirectional Tree structure long-short-term memory network (bidirectional Tree-LSTM). The whole label system is used as a tree structure, and each label is used as a node on the tree; each node receives information from both child nodes (bottom-up) and parent nodes (top-down).

In the Tree-LSTM, the nodes are calculated as follows:

wherein C (j) represents a child node of node j, h _j And c _j Respectively representing hidden layer state and memory cell state of node j, i _j 、f _jk 、o _j Input gate, forget gate and output gate corresponding to each node respectively, v _j The label corresponding to the node j is W, U, the weight parameter of each gating unit is b, the bias term is b, and sigma is a sigmoid activation function.

The calculation of (2) is different according to the information propagation direction, and for the bottom-up propagation, the calculation mode is as follows:

。

for the top-down propagation, the calculation method is as follows:

；

Where P (k, j) is the probability of node j occurring given parent node k, which can be statistically derived from the sample:

wherein N is _k 、N _j The number of times node k and node j occur in the sample, respectively.

Finally, splicing hidden layer states in two directions from bottom to top and from top to bottom to obtain a final state of a node j:

in the aspect of model output, the attention network is used for calculating the information weight of the contribution of codes of each case fact to each label, and then the case feature vector is calculated according to the weight and is used for outputting the classification labels.

The information weight of the case code on the tag is calculated as follows:

wherein x is _i For the coding vector of the i-th case fact (semantic vector of case fact) output by the case fact encoder,

the coding vector of the j-th tag (semantic vector of the tag) output for the tag structure encoder.

Then, calculating a case feature vector according to the information weight:

，

the output of the feature vector after the sigmoid activation function is:

，

wherein v is _i The code vector outputted by the case fact encoder for the ith sample, W _c B is training parameter of classifier _c Is an offset term of the classifier.

Finally, the model is optionally trained using a cross entropy loss function, as shown in the following:

，

Wherein y is _ij The j-th tag for the i-th sample,

and outputting sigmoid of the corresponding sample, wherein M is the number of the samples, and N is the number of the labels.

In the application, the results of the two encoders, namely the case fact semantic vector and the tag semantic vector, learn the feature weights of the tags through an attention network, and finally classify the tags by utilizing the features with different weights to obtain the probability of the case tags.

Based on a general inventive concept, the embodiment of the invention provides a case feature labeling device based on active learning and semi-supervised learning, which is used for realizing the method embodiment.

Fig. 4 is a schematic structural diagram of a case feature labeling device based on active learning and semi-supervised learning according to an embodiment of the present invention, and referring to fig. 4, the device provided in this embodiment may include:

an obtaining module 41, configured to obtain a case fact text to be labeled;

the word segmentation module 42 is configured to perform word segmentation processing on the text of the case facts to be marked, and take all the obtained segmented words as target case facts;

the input labeling module 43 is configured to input the target case fact into the case feature labeling model, and obtain a case feature labeling result of the target case fact; the case characteristic labeling result comprises: the hierarchical relationship of the case characteristics;

Optionally, the method further comprises: the model construction module is used for acquiring the case fact text of each judge document in the judge document set, carrying out word segmentation on each case fact text, and taking all obtained segmented words as a sample set;

Extracting a sample to be marked from a sample set, constructing the sample set to be marked, and constructing the rest unlabeled sample into an unlabeled sample set; labeling case characteristics of each sample to be labeled in the sample set to be labeled and the hierarchical relationship of the case characteristics in response to the labeling instruction of the expert to obtain a labeled sample set; wherein each case feature corresponds to a tag;

inputting unlabeled samples in the unlabeled sample set into a preliminary case feature labeling model to obtain labeling results of each unlabeled sample, wherein the labeling results comprise: a classification probability for each tag;

extracting unlabeled samples with low confidence in unlabeled sample sets to be used as sample sets to be labeled again according to the confidence, and labeling case features and the hierarchical relationship of the case features of each sample to be labeled in the sample sets to be labeled in response to labeling instructions of experts to obtain labeled sample sets; extracting unlabeled samples with high confidence from the unlabeled samples, taking the unlabeled samples with high confidence and corresponding labeling results as labeled samples, and adding the labeled samples into a labeled sample set; and carrying out iteration update on the preliminary case feature labeling model until the iteration times of the preliminary case feature labeling model reach the iteration times threshold value, so as to obtain the case feature labeling model.

Optionally, the model building module is specifically configured to:

Optionally, the preset model includes: a case fact encoder, a tag structure encoder, and an attention network; the case fact encoder comprises a two-way long-short-term memory network and a convolutional neural network which are connected with each other; the label structure encoder comprises a bidirectional tree structure long-term and short-term memory network.

Optionally, word segmentation processing is performed on the case fact text to be marked, including: and according to the LTP tool or the crust word segmentation tool, word segmentation processing is carried out on the case fact text to be marked.

Optionally, the model building module is specifically configured to:

wherein y is _j For the j-th tag to be the (th) tag,

sample x predicted for model has tag y _j N is the number of tags.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 5 is a schematic structural diagram of a case feature labeling device based on active learning and semi-supervised learning according to an embodiment of the present invention. As shown in fig. 5, the case feature labeling apparatus based on active learning and semi-supervised learning of the present embodiment includes a processor 51 and a memory 52, and the processor 51 is connected to the memory 52. Wherein the processor 51 is used to call and execute the program stored in the memory 52; the memory 52 is used for storing a program for executing at least the case feature labeling method based on active learning and semi-supervised learning in the above embodiment.

Specific embodiments of the case feature labeling device based on active learning and semi-supervised learning provided in this application may refer to the implementation manner of the case feature labeling method based on active learning and semi-supervised learning in any of the above embodiments, which is not described herein.

It is to be understood that the same or similar parts in the above embodiments may be referred to each other, and that in some embodiments, the same or similar parts in other embodiments may be referred to.

It should be noted that in the description of the present invention, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present invention, unless otherwise indicated, the meaning of "plurality" means at least two.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. A case characteristic labeling method based on active learning and semi-supervised learning is characterized by comprising the following steps:

acquiring a case fact text to be marked;

the case feature labeling model is obtained by performing active learning and semi-supervised learning pre-training according to part of labeled sample case facts and part of unlabeled sample case facts; the case characteristics of the marked sample case facts are marked with a hierarchical relationship;

further comprises:

2. The method according to claim 1, wherein extracting unlabeled samples with low confidence in the unlabeled sample set as sample sets to be labeled again comprises:

3. The method according to claim 1, wherein extracting unlabeled samples with high confidence from the unlabeled samples, taking the unlabeled samples with high confidence and corresponding labeling results as labeled samples, and adding the labeled samples to the labeled sample set includes:

4. The method of claim 1, wherein the annotated sample comprises: a case fact sample and a corresponding feature tag;

5. The method of claim 4, wherein the pre-set model comprises: a case fact encoder, a tag structure encoder, and an attention network; the case fact encoder comprises a two-way long-short-term memory network and a convolutional neural network which are connected with each other; the tag structure encoder comprises a bidirectional tree structure long-term and short-term memory network.

6. The method according to claim 1, wherein the word segmentation processing of the case fact text to be annotated comprises: and according to an LTP tool or a crust word segmentation tool, word segmentation processing is carried out on the case fact text to be marked.

7. The method of claim 1, wherein calculating a confidence level for each unlabeled exemplar based on the classification probabilities comprises:

wherein y is _j For the j-th tag to be the (th) tag,

sample x predicted for model has tag y _j N is the number of tags.

8. The utility model provides a case situation characteristic marking device based on initiative study and semi-supervised study which characterized in that includes:

the acquiring module is used for acquiring the case fact text to be marked;

further comprises: the model construction module is used for acquiring the case fact text of each judge document in the judge document set, carrying out word segmentation on each case fact text, and taking all obtained segmented words as a sample set; extracting a sample to be marked from the sample set, constructing a sample set to be marked, and constructing the rest unlabeled sample into an unlabeled sample set; labeling the case characteristics of each sample to be labeled in the sample set to be labeled and the hierarchical relationship of the case characteristics to obtain a labeled sample set in response to a labeling instruction of an expert; wherein each case feature corresponds to a tag; model training is carried out based on the marked sample set and a preset model, and a preliminary case feature marking model is obtained; inputting unlabeled samples in the unlabeled sample set into the preliminary case feature labeling model to obtain labeling results of each unlabeled sample, wherein the labeling results comprise: a classification probability for each tag; calculating the confidence coefficient corresponding to each unlabeled sample according to the classification probability; extracting unlabeled samples with low confidence in the unlabeled sample set to be used as sample sets to be labeled again according to the confidence, and labeling the case characteristics and the hierarchical relationship of the case characteristics of each sample to be labeled in the sample sets to be labeled in response to labeling instructions of experts to obtain labeled sample sets; extracting unlabeled samples with high confidence from the unlabeled samples, taking the unlabeled samples with high confidence and corresponding labeling results as labeled samples, and adding the labeled samples into the labeled sample set; and carrying out iterative updating on the preliminary case feature labeling model until the iterative times of the preliminary case feature labeling model reach an iterative times threshold value, so as to obtain the case feature labeling model.

9. The case characteristic labeling device based on active learning and semi-supervised learning is characterized by comprising a processor and a memory, wherein the processor is connected with the memory:

the memory is used for storing the program, and the program is at least used for executing the case characteristic labeling method based on active learning and semi-supervised learning according to any one of claims 1-7.