CN113220850B - Case image mining method for court trial and reading - Google Patents

Case image mining method for court trial and reading Download PDF

Info

Publication number
CN113220850B
CN113220850B CN202110451235.2A CN202110451235A CN113220850B CN 113220850 B CN113220850 B CN 113220850B CN 202110451235 A CN202110451235 A CN 202110451235A CN 113220850 B CN113220850 B CN 113220850B
Authority
CN
China
Prior art keywords
case
information
training
personal information
paragraphs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110451235.2A
Other languages
Chinese (zh)
Other versions
CN113220850A (en
Inventor
张可
杨晨
殷敏
费志伟
顾平莉
李常宝
刘忠麟
艾中良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 15 Research Institute
Original Assignee
CETC 15 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 15 Research Institute filed Critical CETC 15 Research Institute
Priority to CN202110451235.2A priority Critical patent/CN113220850B/en
Publication of CN113220850A publication Critical patent/CN113220850A/en
Application granted granted Critical
Publication of CN113220850B publication Critical patent/CN113220850B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a case image mining method for court trial viewing, which extracts case image information by using a method based on rules and a method based on ideal combination of natural language of statistical learning, firstly acquires a large number of real cases, preprocesses the case files, then constructs an empty case image tree which is information to be extracted according to relevant legal knowledge such as constitution, criminal law and the like, extracts information required by the image tree based on the extracted rules and model, can train only by a small amount of marking data, finally generates a complete case image tree according to the extracted information, can realize computer automated viewing, and extracts the case information to be convenient for a judges to quickly learn the case information and subsequent automation decide a case.

Description

Case image mining method for court trial and reading
Technical Field
The invention belongs to the field of Information Extraction (IE) in natural language processing technology, and particularly relates to a case image mining method for court trial and reading.
Background
With the development of computing device performance and the development of the internet in recent years, the deep learning technology has made great progress in the fields of images, texts and the like, and has achieved great success in the fields of object detection, image segmentation, picture classification, text translation and the like, but the application of the deep learning technology in the judicial field, particularly in the field of Chinese judicial, is still in a research start stage. At present, a judicial person is required to read the file to know the case condition in the case trial process, but the case file comprises a large number of files including a series of files such as inquiry records, search records, witness inquiry records and the like, and a large amount of labor cost is required for reading and sorting the files, and the case trial efficiency is also influenced, so that information useful for the case trial is automatically extracted by using a computer technology, and the information is displayed in a structure, so that the judicial person can be greatly relieved from the complicated file reading.
The construction of case portraits belongs to the field of information extraction, and the Information Extraction (IE) refers to a text processing technology for automatically extracting the fact information of entities, relations, events and the like of a specified type from natural language texts and forming structured data output. Information extraction in text is associated with text simplification problems, the general purpose being to create text that is more readable to the machine to process sentences. The information extraction generally comprises the following sub-tasks: the first step in most IE tasks is to find the proper names (properties names) or named entities (NAMED ENTITIES) mentioned in the text, and the task of Named Entity Recognition (NER) is to find each named entity in the text and tag its type. The task of the relationship extraction (relation extraction) is to find and classify semantic relationships between text entities, typically binary relationships such as spouse, child, employment, subordinate and geospatial location relationships. The relationship extraction has a close relationship with populating the relationship database. The task of event extraction (event extraction) is to find events in which these entities participate, e.g., increasing the price of two airlines in the united states and reporting events described and referenced by the events. We also need to determine which of many events mentioned in the text refer to the same event by looking for commonalities. In addition, rule-based natural language processing techniques are also widely used in the field of information extraction.
At present, in the field of judicial intelligent service, a lot of work is devoted to the technical research of case element extraction. The traditional case element extraction technology mostly adopts an information search technology taking keyword matching as a core, has the problems of insufficient normalization, accuracy, search efficiency and the like, and is difficult to ensure the realization of the development appeal of high intellectualization, sentency precision and function diversification of the intelligent judge auxiliary system. In order to realize a standard, accurate and efficient case element extraction technology, several main requirements still need to be studied: 1. the system is lack of a professional, unified and standard case element knowledge system, 2. The case element extraction technology based on rules is low in accuracy and poor in expandability, and 3. A large amount of labeling data for statistical learning is lacked.
Disclosure of Invention
In view of the above, the present invention aims to provide a case image mining method for court trial and reading, which can accurately extract case information.
A case portrait mining method for court trial viewing includes the following steps:
Step 1, acquiring court trial file data, cleaning the file data to remove dirty data, and classifying the data according to different criminal names to form an original data set;
step 2, defining a case image tree model;
step 3, extracting information, which specifically comprises the following steps:
Step 31, training the BERT language model by adopting the file data obtained in the step 1, and dividing the text paragraph into three categories of personal information, case facts and other categories; then, identifying the category of each paragraph by adopting a trained BERT language model;
step 32, extracting personal information of paragraphs classified as personal information; for paragraphs classified as case facts, extracting case fact information in the paragraphs;
And 4, filling the case image tree model in the step 2 according to the personal information and the case fact information obtained in the step3 to obtain a case image tree.
Preferably, the specific method in step 31 is as follows:
Selecting a document in the file data to mark the paragraphs, wherein the three types are divided into three types: personal information, case facts, and others, each class containing a number of paragraph samples as a training dataset; training the BERT language model by utilizing a training data set;
Calculating the output of each paragraph sample in all training data sets by using the BERT language model after training, and calculating the average value of the sample output of each class, wherein the average value is defined as: s 1、s2、s3;
In the prediction stage, BERT model output of a prediction sample is obtained, euclidean distance between the output and s 1、s2、s3 is calculated respectively, and finally the sample to be predicted is classified into the category with the minimum Euclidean distance.
Preferably, after the BERT language model is trained, the measurement learning pair is adopted to adjust, and then the adjusted BERT language model is used for predicting sample output.
Preferably, in the step 32, the method for extracting the personal information is as follows:
The question-answer pairs of the personal information are found through the segmentation model, and then answers of the suspects are segmented according to periods;
The first sentence is cut according to commas, and words and tags are corresponding to the sequence of name, gender, birth month, native place, cultural degree, identification card number, occupation and household location after the cutting.
Preferably, in the step 32, the personal information is extracted using a named entity recognition method.
Preferably, in the step 32, the method for extracting the case fact information of the criminal suspects includes:
selecting paragraphs classified as case facts in the query strokes;
traversing all the selected paragraphs, and matching four words of 'crime process' in the question sentence, wherein the successfully matched paragraphs are descriptive paragraphs of case facts.
Preferably, the victim's case fact extraction is based on the victim's interrogation record, and the extraction method is consistent with criminal suspects.
Preferably, the crime history information of the suspect is extracted by the following steps:
For paragraphs classified as personal information, finding question-answer pairs of the personal information, and then dividing answers of the suspects according to periods;
Traversing the segmented sentence using the regular expression ".?? Is? Is? Court department? And matching sentences, wherein the sentences which can be successfully matched are historical criminal information of the suspects.
The invention has the following beneficial effects:
the invention uses a method based on rules and a method based on natural language processing of statistical learning to extract case image information, firstly, a large number of real cases are obtained and the case files are preprocessed, then, information to be extracted, namely, empty case image trees are constructed according to relevant legal knowledge such as constitution, criminal law and the like, then, information required by the image trees is extracted based on the extracted rules and models, training can be carried out only by a small amount of labeling data, finally, complete case image trees are generated according to the extracted information, computer automated examination paper can be realized, and the case information is extracted to facilitate a judges to quickly know the case information and follow-up automation decide a case.
Drawings
FIG. 1 is a representation of a case created in the present invention;
fig. 2 is a flowchart of a case image mining method for court trial viewing according to the present invention.
Detailed Description
The invention will now be described in detail by way of example with reference to the accompanying drawings.
The invention is realized by the following steps:
Step 1, acquiring and processing a case file
And acquiring the file data of other institutions such as a judge document network, a court and the like, cleaning the file data to remove dirty data, and classifying the file data according to different crime names to form an original data set.
Step 2, defining a case image tree
The case portrait tree is structured by the file data, so that information useful for case judgment is displayed. Thus, first, information to be extracted is defined, including basic information of criminal suspects, basic information of victims, case processes, etc., and when the information is defined, an empty case image tree is formed.
Step 3, construction of an information extractor
Case volume data typically contains a large number of documents, but documents that build case portraits are largely divided into three categories: the invention aims at extracting case information by using rules and a machine learning model according to three types of files, wherein the types of the records (including inquiry records, witness records and the like), the types of the documents (including prosecution books and the like) and the types of the tables (including resident population information tables, price identification conclusion books and the like) are included, and the composition of case images mainly comes from the three types of files.
The information extraction process of the invention is divided into two steps: firstly segmenting a document, roughly acquiring the meaning expressed by the text of each paragraph, and then acquiring the information of the case portrait through the model or the rule depth.
Step 31, segmentation procedure
The documents of the case file have certain standardization in writing, for example: the inquiry stroke is in a form of one-to-one answer, the inquiry questions mainly comprise personal information and crime process, and the prosecution books generally describe the crime process after introducing the personal information (including historical crime condition) of the suspects, so that a feasible scheme is that the main content of the text is firstly determined and then the information is extracted by adopting different methods through the main content.
The case information extracted by the method is mainly derived from inquiry strokes, prosecution books and other files, and the method is carried out by adopting small sample learning based on measurement learning for file segmentation. The invention mainly extracts personal information and case facts, and the description contents of the two parts have larger semantic difference, so the invention adopts a pre-trained BERT language model to classify the paragraphs, and uses metric learning to finely adjust the pre-trained model, and the main steps are as follows:
1. Constructing a data set, selecting a small part of documents to mark the paragraphs, and dividing the paragraphs into three types: personal information, case facts, and others, each class contains several paragraph samples, and the training dataset can be expressed as: d= { (x 1,y1),(x2,y2),…,(xn,yn) }, where x i represents the i-th training sample, and y i represents the label of the i-th training sample;
2. the invention adopts Euclidean distance to measure the similarity of two samples, and the specific loss function is as follows:
f represents the BERT pre-training model, x i1 and x i2 are training data from different classes in the training dataset, and the invention maximizes this loss function using a random gradient descent algorithm when training.
3. Calculating the output (an n-dimensional vector) of all training set data by using the fine-tuned pre-training model, and calculating the average value of the sample output of each class of training data set as follows: s 1、s2、s3;
4. In the prediction stage, BERT model output of a prediction sample is obtained firstly, euclidean distance between the output and s 1、s2、s3 is calculated respectively, and finally the sample to be predicted is classified into the category with the minimum Euclidean distance.
Step 32, information extraction
The text after segmentation can be matched with information required by the case image tree through rules, and the information can be extracted through a machine learning model. For the personal information of the suspects, the method extracts and provides a rule-based mode and a model-based mode for extraction in the interrogation stroke, wherein the rule-based extraction mode is as follows:
1. the question-answer pairs of the personal information are found through the segmentation model, and then answers of the suspects are segmented according to periods;
2. Dividing the first sentence according to commas, and then, corresponding the words and the labels according to the sequence of name, gender, birth year, date, cultural degree, identification card number, occupation and household location;
the method based on the model in the invention mainly uses named entity recognition technology for extraction, and a named entity recognition model is trained for extracting information from the sentences, and the main process is as follows:
1. Marking the data by adopting a BIOE marking mode;
2. training a model using BiLSTM +crf;
3. And predicting the data by using the trained model.
Experiments find that the named entity recognition model trained by BiLSTM +CRF can well recognize personal information texts and has good generalization capability.
The crime history information of the suspects is also derived from inquiry records, and the invention uses rules to extract the history information of the suspects, and the main process is as follows:
1. the question-answer pairs of the personal information are found through the segmentation model, and then answers of the suspects are segmented according to periods;
2. Traversing the segmented sentence uses a regular expression (python language) ". Is? Is? Court department? And matching sentences, wherein the sentences which can be successfully matched are historical criminal information of the suspects.
The invention mainly adopts rules to extract the cases of criminal suspects, and the extraction process is as follows:
1. finding out a description text of the case facts through the segmentation model;
2. traversing all the selected paragraphs, and matching four words of 'crime process' in the question sentence, wherein the successfully matched paragraphs are descriptive paragraphs of case facts.
The construction of the victim case facts is similar to the construction method of criminal suspects case facts, except that the victim case facts are extracted from the query strokes of the victim.
Step 4, generating a case image tree
And associating the extracted case information with the case elements defined on the case image tree, and visually displaying the case image tree.
In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (3)

1. The case portrait mining method for court trial viewing is characterized by comprising the following steps:
Step 1, acquiring court trial file data, cleaning the file data to remove dirty data, and classifying the data according to different criminal names to form an original data set;
step 2, defining a case image tree model;
step 3, extracting information, which specifically comprises the following steps:
Step 31, training the BERT language model by adopting the file data obtained in the step 1, and dividing the text paragraph into three categories of personal information, case facts and other categories; then, identifying the category of each paragraph by adopting a trained BERT language model;
Step 32, extracting personal information in paragraphs classified as personal information, and extracting crime history information of a suspected person; for paragraphs classified as case facts, extracting case fact information in the paragraphs;
step 4, according to the personal information and the case fact information obtained in the step 3, associating the extracted case information with case elements defined on the case image tree, and filling the case image tree model in the step 2 to obtain the case image tree;
The specific method in step 31 is as follows:
Selecting a document in the file data to mark the paragraphs, wherein the three types are divided into three types: personal information, case facts, and others, each class containing a number of paragraph samples as a training dataset; training the BERT language model by utilizing a training data set;
Calculating the output of each paragraph sample in all training data sets by using the BERT language model after training, and calculating the average value of the sample output of each class, wherein the average value is defined as: s 1、s2、s3;
In the prediction stage, firstly obtaining BERT model output of a prediction sample, then respectively calculating Euclidean distance between the output and s 1、s2、s3, and finally classifying the sample to be predicted into a category with the minimum Euclidean distance; after the BERT language model is trained, the BERT language model is adjusted by adopting measurement learning, and then a sample is predicted and output by using the adjusted BERT language model; wherein, the method for adjusting the pre-training BERT language model by using metric learning comprises the following steps:
similarity of two samples is measured by adopting Euclidean distance, and a specific loss function is as follows:
f represents a BERT pre-training model, x i1 and x i2 are training data from different categories in the training data set, and a random gradient descent algorithm is used for maximizing the loss function during training;
In the step 32, the method for extracting the case fact information of the criminal suspects is as follows:
selecting paragraphs classified as case facts in the query strokes;
traversing all the selected paragraphs, and matching four words of a crime process in the question sentence, wherein the successfully matched paragraphs are description paragraphs of case facts;
the case fact extraction of the victim is based on the query stroke of the victim, and the extraction method is consistent with that of the criminal suspects;
the crime history information extraction method of the suspects comprises the following steps:
For paragraphs classified as personal information, finding question-answer pairs of the personal information, and then dividing answers of the suspects according to periods;
Traversing the segmented sentence using the regular expression ".?? Is? Is? Court department? And matching sentences, wherein the sentences which can be successfully matched are historical criminal information of the suspects.
2. The case image mining method for court trial viewing according to claim 1, wherein in the step 32, the method for extracting the personal information is as follows:
The question-answer pairs of the personal information are found through the segmentation model, and then answers of the suspects are segmented according to periods;
The first sentence is cut according to commas, and words and tags are corresponding to the sequence of name, gender, birth month, native place, cultural degree, identification card number, occupation and household location after the cutting.
3. The case representation mining method for court trial viewing according to claim 1, wherein in the step 32, personal information is extracted using a named entity recognition method.
CN202110451235.2A 2021-04-26 2021-04-26 Case image mining method for court trial and reading Active CN113220850B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110451235.2A CN113220850B (en) 2021-04-26 2021-04-26 Case image mining method for court trial and reading

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110451235.2A CN113220850B (en) 2021-04-26 2021-04-26 Case image mining method for court trial and reading

Publications (2)

Publication Number Publication Date
CN113220850A CN113220850A (en) 2021-08-06
CN113220850B true CN113220850B (en) 2024-06-11

Family

ID=77089063

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110451235.2A Active CN113220850B (en) 2021-04-26 2021-04-26 Case image mining method for court trial and reading

Country Status (1)

Country Link
CN (1) CN113220850B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710712A (en) * 2018-12-17 2019-05-03 中国人民公安大学 A kind of crime hot spot feature method for digging and system based on case factor analysis
CN110059193A (en) * 2019-06-21 2019-07-26 南京擎盾信息科技有限公司 Legal advice system based on law semanteme part and document big data statistical analysis
CN111145052A (en) * 2019-12-26 2020-05-12 北京法意科技有限公司 Structured analysis method and system of judicial documents
CN111680125A (en) * 2020-06-05 2020-09-18 深圳市华云中盛科技股份有限公司 Litigation case analysis method, litigation case analysis device, computer device, and storage medium
CN111984687A (en) * 2020-07-20 2020-11-24 武汉市润普网络科技有限公司 Executing case drawing system
CN112417880A (en) * 2020-11-30 2021-02-26 太极计算机股份有限公司 Court electronic file oriented case information automatic extraction method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710712A (en) * 2018-12-17 2019-05-03 中国人民公安大学 A kind of crime hot spot feature method for digging and system based on case factor analysis
CN110059193A (en) * 2019-06-21 2019-07-26 南京擎盾信息科技有限公司 Legal advice system based on law semanteme part and document big data statistical analysis
CN111145052A (en) * 2019-12-26 2020-05-12 北京法意科技有限公司 Structured analysis method and system of judicial documents
CN111680125A (en) * 2020-06-05 2020-09-18 深圳市华云中盛科技股份有限公司 Litigation case analysis method, litigation case analysis device, computer device, and storage medium
CN111984687A (en) * 2020-07-20 2020-11-24 武汉市润普网络科技有限公司 Executing case drawing system
CN112417880A (en) * 2020-11-30 2021-02-26 太极计算机股份有限公司 Court electronic file oriented case information automatic extraction method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
自然语言处理技术在司法过程中的应用研究;张德;;信息与电脑(理论版)(17);全文 *

Also Published As

Publication number Publication date
CN113220850A (en) 2021-08-06

Similar Documents

Publication Publication Date Title
O***llah et al. PHDIndic_11: page-level handwritten document image dataset of 11 official Indic scripts for script identification
Neculoiu et al. Learning text similarity with siamese recurrent networks
CN113283551B (en) Training method and training device of multi-mode pre-training model and electronic equipment
CN111597304A (en) Secondary matching method for accurately identifying Chinese enterprise name entity
CN112632989B (en) Method, device and equipment for prompting risk information in contract text
CN107315738A (en) A kind of innovation degree appraisal procedure of text message
CN113505586A (en) Seat-assisted question-answering method and system integrating semantic classification and knowledge graph
CN111159356B (en) Knowledge graph construction method based on teaching content
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN113946677A (en) Event identification and classification method based on bidirectional cyclic neural network and attention mechanism
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN115659947A (en) Multi-item selection answering method and system based on machine reading understanding and text summarization
CN110704615B (en) Internet financial non-dominant advertisement identification method and device
KR102185733B1 (en) Server and method for automatically generating profile
CN112989811B (en) History book reading auxiliary system based on BiLSTM-CRF and control method thereof
CN111597330A (en) Intelligent expert recommendation-oriented user image drawing method based on support vector machine
CN111459973B (en) Case type retrieval method and system based on case situation triple information
CN113553419A (en) Civil aviation knowledge map question-answering system
CN116629258A (en) Structured analysis method and system for judicial document based on complex information item data
CN113220850B (en) Case image mining method for court trial and reading
Pedersen et al. Lessons learned developing and using a machine learning model to automatically transcribe 2.3 million handwritten occupation codes
CN111078874A (en) Foreign Chinese difficulty assessment method based on decision tree classification of random subspace
Silva Parts that add up to a whole: a framework for the analysis of tables
CN116383331A (en) Method and system for constructing Chinese event library and analyzing and predicting meta event based on meta event library
CN111798217B (en) Data analysis system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant