CN113220850B

CN113220850B - Case image mining method for court trial and reading

Info

Publication number: CN113220850B
Application number: CN202110451235.2A
Authority: CN
Inventors: 张可; 杨晨; 殷敏; 费志伟; 顾平莉; 李常宝; 刘忠麟; 艾中良
Original assignee: CETC 15 Research Institute
Current assignee: CETC 15 Research Institute
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2024-06-11
Anticipated expiration: 2041-04-26
Also published as: CN113220850A

Abstract

The invention discloses a case image mining method for court trial viewing, which extracts case image information by using a method based on rules and a method based on ideal combination of natural language of statistical learning, firstly acquires a large number of real cases, preprocesses the case files, then constructs an empty case image tree which is information to be extracted according to relevant legal knowledge such as constitution, criminal law and the like, extracts information required by the image tree based on the extracted rules and model, can train only by a small amount of marking data, finally generates a complete case image tree according to the extracted information, can realize computer automated viewing, and extracts the case information to be convenient for a judges to quickly learn the case information and subsequent automation decide a case.

Description

Case image mining method for court trial and reading

Technical Field

The invention belongs to the field of Information Extraction (IE) in natural language processing technology, and particularly relates to a case image mining method for court trial and reading.

Background

With the development of computing device performance and the development of the internet in recent years, the deep learning technology has made great progress in the fields of images, texts and the like, and has achieved great success in the fields of object detection, image segmentation, picture classification, text translation and the like, but the application of the deep learning technology in the judicial field, particularly in the field of Chinese judicial, is still in a research start stage. At present, a judicial person is required to read the file to know the case condition in the case trial process, but the case file comprises a large number of files including a series of files such as inquiry records, search records, witness inquiry records and the like, and a large amount of labor cost is required for reading and sorting the files, and the case trial efficiency is also influenced, so that information useful for the case trial is automatically extracted by using a computer technology, and the information is displayed in a structure, so that the judicial person can be greatly relieved from the complicated file reading.

The construction of case portraits belongs to the field of information extraction, and the Information Extraction (IE) refers to a text processing technology for automatically extracting the fact information of entities, relations, events and the like of a specified type from natural language texts and forming structured data output. Information extraction in text is associated with text simplification problems, the general purpose being to create text that is more readable to the machine to process sentences. The information extraction generally comprises the following sub-tasks: the first step in most IE tasks is to find the proper names (properties names) or named entities (NAMED ENTITIES) mentioned in the text, and the task of Named Entity Recognition (NER) is to find each named entity in the text and tag its type. The task of the relationship extraction (relation extraction) is to find and classify semantic relationships between text entities, typically binary relationships such as spouse, child, employment, subordinate and geospatial location relationships. The relationship extraction has a close relationship with populating the relationship database. The task of event extraction (event extraction) is to find events in which these entities participate, e.g., increasing the price of two airlines in the united states and reporting events described and referenced by the events. We also need to determine which of many events mentioned in the text refer to the same event by looking for commonalities. In addition, rule-based natural language processing techniques are also widely used in the field of information extraction.

At present, in the field of judicial intelligent service, a lot of work is devoted to the technical research of case element extraction. The traditional case element extraction technology mostly adopts an information search technology taking keyword matching as a core, has the problems of insufficient normalization, accuracy, search efficiency and the like, and is difficult to ensure the realization of the development appeal of high intellectualization, sentency precision and function diversification of the intelligent judge auxiliary system. In order to realize a standard, accurate and efficient case element extraction technology, several main requirements still need to be studied: 1. the system is lack of a professional, unified and standard case element knowledge system, 2. The case element extraction technology based on rules is low in accuracy and poor in expandability, and 3. A large amount of labeling data for statistical learning is lacked.

Disclosure of Invention

In view of the above, the present invention aims to provide a case image mining method for court trial and reading, which can accurately extract case information.

A case portrait mining method for court trial viewing includes the following steps:

Step 1, acquiring court trial file data, cleaning the file data to remove dirty data, and classifying the data according to different criminal names to form an original data set;

step 2, defining a case image tree model;

step 3, extracting information, which specifically comprises the following steps:

Step 31, training the BERT language model by adopting the file data obtained in the step 1, and dividing the text paragraph into three categories of personal information, case facts and other categories; then, identifying the category of each paragraph by adopting a trained BERT language model;

step 32, extracting personal information of paragraphs classified as personal information; for paragraphs classified as case facts, extracting case fact information in the paragraphs;

And 4, filling the case image tree model in the step 2 according to the personal information and the case fact information obtained in the step3 to obtain a case image tree.

Preferably, the specific method in step 31 is as follows:

Selecting a document in the file data to mark the paragraphs, wherein the three types are divided into three types: personal information, case facts, and others, each class containing a number of paragraph samples as a training dataset; training the BERT language model by utilizing a training data set;

Calculating the output of each paragraph sample in all training data sets by using the BERT language model after training, and calculating the average value of the sample output of each class, wherein the average value is defined as: s ₁、s₂、s₃;

In the prediction stage, BERT model output of a prediction sample is obtained, euclidean distance between the output and s ₁、s₂、s₃ is calculated respectively, and finally the sample to be predicted is classified into the category with the minimum Euclidean distance.

Preferably, after the BERT language model is trained, the measurement learning pair is adopted to adjust, and then the adjusted BERT language model is used for predicting sample output.

Preferably, in the step 32, the method for extracting the personal information is as follows:

The question-answer pairs of the personal information are found through the segmentation model, and then answers of the suspects are segmented according to periods;

The first sentence is cut according to commas, and words and tags are corresponding to the sequence of name, gender, birth month, native place, cultural degree, identification card number, occupation and household location after the cutting.

Preferably, in the step 32, the personal information is extracted using a named entity recognition method.

Preferably, in the step 32, the method for extracting the case fact information of the criminal suspects includes:

selecting paragraphs classified as case facts in the query strokes;

traversing all the selected paragraphs, and matching four words of 'crime process' in the question sentence, wherein the successfully matched paragraphs are descriptive paragraphs of case facts.

Preferably, the victim's case fact extraction is based on the victim's interrogation record, and the extraction method is consistent with criminal suspects.

Preferably, the crime history information of the suspect is extracted by the following steps:

For paragraphs classified as personal information, finding question-answer pairs of the personal information, and then dividing answers of the suspects according to periods;

Traversing the segmented sentence using the regular expression ".?? Is? Is? Court department? And matching sentences, wherein the sentences which can be successfully matched are historical criminal information of the suspects.

The invention has the following beneficial effects:

the invention uses a method based on rules and a method based on natural language processing of statistical learning to extract case image information, firstly, a large number of real cases are obtained and the case files are preprocessed, then, information to be extracted, namely, empty case image trees are constructed according to relevant legal knowledge such as constitution, criminal law and the like, then, information required by the image trees is extracted based on the extracted rules and models, training can be carried out only by a small amount of labeling data, finally, complete case image trees are generated according to the extracted information, computer automated examination paper can be realized, and the case information is extracted to facilitate a judges to quickly know the case information and follow-up automation decide a case.

Drawings

FIG. 1 is a representation of a case created in the present invention;

fig. 2 is a flowchart of a case image mining method for court trial viewing according to the present invention.

Detailed Description

The invention will now be described in detail by way of example with reference to the accompanying drawings.

The invention is realized by the following steps:

Step 1, acquiring and processing a case file

And acquiring the file data of other institutions such as a judge document network, a court and the like, cleaning the file data to remove dirty data, and classifying the file data according to different crime names to form an original data set.

Step 2, defining a case image tree

The case portrait tree is structured by the file data, so that information useful for case judgment is displayed. Thus, first, information to be extracted is defined, including basic information of criminal suspects, basic information of victims, case processes, etc., and when the information is defined, an empty case image tree is formed.

Step 3, construction of an information extractor

Case volume data typically contains a large number of documents, but documents that build case portraits are largely divided into three categories: the invention aims at extracting case information by using rules and a machine learning model according to three types of files, wherein the types of the records (including inquiry records, witness records and the like), the types of the documents (including prosecution books and the like) and the types of the tables (including resident population information tables, price identification conclusion books and the like) are included, and the composition of case images mainly comes from the three types of files.

The information extraction process of the invention is divided into two steps: firstly segmenting a document, roughly acquiring the meaning expressed by the text of each paragraph, and then acquiring the information of the case portrait through the model or the rule depth.

Step 31, segmentation procedure

The documents of the case file have certain standardization in writing, for example: the inquiry stroke is in a form of one-to-one answer, the inquiry questions mainly comprise personal information and crime process, and the prosecution books generally describe the crime process after introducing the personal information (including historical crime condition) of the suspects, so that a feasible scheme is that the main content of the text is firstly determined and then the information is extracted by adopting different methods through the main content.

The case information extracted by the method is mainly derived from inquiry strokes, prosecution books and other files, and the method is carried out by adopting small sample learning based on measurement learning for file segmentation. The invention mainly extracts personal information and case facts, and the description contents of the two parts have larger semantic difference, so the invention adopts a pre-trained BERT language model to classify the paragraphs, and uses metric learning to finely adjust the pre-trained model, and the main steps are as follows:

1. Constructing a data set, selecting a small part of documents to mark the paragraphs, and dividing the paragraphs into three types: personal information, case facts, and others, each class contains several paragraph samples, and the training dataset can be expressed as: d= { (x ₁,y₁),(x₂,y₂),…,(x_n,y_n) }, where x _i represents the i-th training sample, and y _i represents the label of the i-th training sample;

2. the invention adopts Euclidean distance to measure the similarity of two samples, and the specific loss function is as follows:

f represents the BERT pre-training model, x _i1 and x _i2 are training data from different classes in the training dataset, and the invention maximizes this loss function using a random gradient descent algorithm when training.

3. Calculating the output (an n-dimensional vector) of all training set data by using the fine-tuned pre-training model, and calculating the average value of the sample output of each class of training data set as follows: s ₁、s₂、s₃;

4. In the prediction stage, BERT model output of a prediction sample is obtained firstly, euclidean distance between the output and s ₁、s₂、s₃ is calculated respectively, and finally the sample to be predicted is classified into the category with the minimum Euclidean distance.

Step 32, information extraction

The text after segmentation can be matched with information required by the case image tree through rules, and the information can be extracted through a machine learning model. For the personal information of the suspects, the method extracts and provides a rule-based mode and a model-based mode for extraction in the interrogation stroke, wherein the rule-based extraction mode is as follows:

1. the question-answer pairs of the personal information are found through the segmentation model, and then answers of the suspects are segmented according to periods;

2. Dividing the first sentence according to commas, and then, corresponding the words and the labels according to the sequence of name, gender, birth year, date, cultural degree, identification card number, occupation and household location;

the method based on the model in the invention mainly uses named entity recognition technology for extraction, and a named entity recognition model is trained for extracting information from the sentences, and the main process is as follows:

1. Marking the data by adopting a BIOE marking mode;

2. training a model using BiLSTM +crf;

3. And predicting the data by using the trained model.

Experiments find that the named entity recognition model trained by BiLSTM +CRF can well recognize personal information texts and has good generalization capability.

The crime history information of the suspects is also derived from inquiry records, and the invention uses rules to extract the history information of the suspects, and the main process is as follows:

2. Traversing the segmented sentence uses a regular expression (python language) ". Is? Is? Court department? And matching sentences, wherein the sentences which can be successfully matched are historical criminal information of the suspects.

The invention mainly adopts rules to extract the cases of criminal suspects, and the extraction process is as follows:

1. finding out a description text of the case facts through the segmentation model;

2. traversing all the selected paragraphs, and matching four words of 'crime process' in the question sentence, wherein the successfully matched paragraphs are descriptive paragraphs of case facts.

The construction of the victim case facts is similar to the construction method of criminal suspects case facts, except that the victim case facts are extracted from the query strokes of the victim.

Step 4, generating a case image tree

And associating the extracted case information with the case elements defined on the case image tree, and visually displaying the case image tree.

In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The case portrait mining method for court trial viewing is characterized by comprising the following steps:

step 2, defining a case image tree model;

Step 32, extracting personal information in paragraphs classified as personal information, and extracting crime history information of a suspected person; for paragraphs classified as case facts, extracting case fact information in the paragraphs;

step 4, according to the personal information and the case fact information obtained in the step 3, associating the extracted case information with case elements defined on the case image tree, and filling the case image tree model in the step 2 to obtain the case image tree;

The specific method in step 31 is as follows:

In the prediction stage, firstly obtaining BERT model output of a prediction sample, then respectively calculating Euclidean distance between the output and s ₁、s₂、s₃, and finally classifying the sample to be predicted into a category with the minimum Euclidean distance; after the BERT language model is trained, the BERT language model is adjusted by adopting measurement learning, and then a sample is predicted and output by using the adjusted BERT language model; wherein, the method for adjusting the pre-training BERT language model by using metric learning comprises the following steps:

similarity of two samples is measured by adopting Euclidean distance, and a specific loss function is as follows:

f represents a BERT pre-training model, x ⁱ¹ and x ⁱ² are training data from different categories in the training data set, and a random gradient descent algorithm is used for maximizing the loss function during training;

In the step 32, the method for extracting the case fact information of the criminal suspects is as follows:

selecting paragraphs classified as case facts in the query strokes;

traversing all the selected paragraphs, and matching four words of a crime process in the question sentence, wherein the successfully matched paragraphs are description paragraphs of case facts;

the case fact extraction of the victim is based on the query stroke of the victim, and the extraction method is consistent with that of the criminal suspects;

the crime history information extraction method of the suspects comprises the following steps:

2. The case image mining method for court trial viewing according to claim 1, wherein in the step 32, the method for extracting the personal information is as follows:

3. The case representation mining method for court trial viewing according to claim 1, wherein in the step 32, personal information is extracted using a named entity recognition method.