CN114547315A - Case classification prediction method and device, computer equipment and storage medium - Google Patents

Case classification prediction method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN114547315A
CN114547315A CN202210437363.6A CN202210437363A CN114547315A CN 114547315 A CN114547315 A CN 114547315A CN 202210437363 A CN202210437363 A CN 202210437363A CN 114547315 A CN114547315 A CN 114547315A
Authority
CN
China
Prior art keywords
keyword
case
classified
case data
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210437363.6A
Other languages
Chinese (zh)
Inventor
陈晓红
付震坤
胡东滨
梁伟
徐雪松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University of Technology
Original Assignee
Hunan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University of Technology filed Critical Hunan University of Technology
Priority to CN202210437363.6A priority Critical patent/CN114547315A/en
Publication of CN114547315A publication Critical patent/CN114547315A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a case classification prediction method, a case classification prediction device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring case data to be classified; extracting key words from case data to be classified, and performing text vectorization on each extracted key word to obtain key word vectors corresponding to the key words; carrying out data enhancement processing on sentences in case data to be classified to obtain at least one enhanced sentence; performing text vectorization on all the enhanced sentences to obtain sentence vectors corresponding to each enhanced sentence; splicing all the keyword vectors and all the sentence vectors according to a preset sequence to obtain spliced vectors; and extracting context information and predicting and classifying the splicing vectors to obtain a classification result corresponding to case data to be classified.

Description

Case classification prediction method and device, computer equipment and storage medium
Technical Field
The invention relates to the field of artificial intelligence and judicial science, in particular to a case classification prediction method, a case classification prediction device, computer equipment and a storage medium.
Background
At present, in the process of case examination and judgment in the court, professional legal knowledge and skills of law-related personnel are required, and a lot of heavy and repetitive work such as understanding of case documents, searching work of related laws and similar cases and the like are also required, so that a lot of time and effort of case judgment personnel are consumed even if the professional personnel are allowed to do the same.
At present, the intelligent auxiliary system for court proposal is mainly adopted for classifying cases. Aiming at the category identification of cases, the cases are mainly filled in a case characteristic column after being manually screened by trial personnel in a case court. Because the case categories have the characteristics of wide related field, various illegal behaviors and the like, the condition of missing the categories is easy to occur when the cases are classified, and the accuracy is low.
Therefore, there is a problem that the case classification accuracy is low.
Disclosure of Invention
The embodiment of the invention provides a case classification prediction method, a case classification prediction device, computer equipment and a storage medium, and aims to improve the accuracy of case classification prediction.
In order to solve the above technical problem, an embodiment of the present application provides a case classification prediction method, including.
And acquiring case data to be classified.
And performing keyword extraction on the case data to be classified based on a preset keyword extraction mode, and performing text vectorization on each extracted keyword to obtain a keyword vector corresponding to the keyword.
And based on a comparison learning mode, performing data enhancement processing on the sentences in the case data to be classified to obtain at least one enhanced sentence.
And carrying out text vectorization on all the enhanced sentences to obtain sentence vectors corresponding to each enhanced sentence.
And splicing all the keyword vectors and all the sentence vectors according to a preset sequence to obtain spliced vectors.
And extracting context information and performing prediction classification on the spliced vector to obtain a classification result corresponding to the case data to be classified.
In order to solve the above technical problem, an embodiment of the present application further provides a case classification prediction apparatus, including.
And the case data to be classified acquiring module is used for acquiring the case data to be classified.
And the keyword vector extraction module is used for extracting keywords from the case data to be classified based on a preset keyword extraction mode, and performing text vectorization on each extracted keyword to obtain a keyword vector corresponding to the keyword.
And the enhanced sentence acquisition module is used for carrying out data enhancement processing on the sentences in the case data to be classified based on a comparison learning mode to obtain at least one enhanced sentence.
And the sentence vector acquisition module is used for carrying out text vectorization on all the enhanced sentences to obtain a sentence vector corresponding to each enhanced sentence.
And the splicing module is used for splicing all the keyword vectors and all the sentence vectors according to a preset sequence to obtain spliced vectors.
And the classification result acquisition module is used for extracting context information and predicting and classifying the splicing vector to obtain a classification result corresponding to the case data to be classified.
In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the case classification prediction method when executing the computer program.
In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the case classification prediction method described above.
The case classification prediction method, the case classification prediction device, the computer equipment and the storage medium provided by the embodiment of the invention acquire the data of the case to be classified; extracting keywords from the case data to be classified based on a preset keyword extraction mode, and performing text vectorization on each extracted keyword to obtain a keyword vector corresponding to the keyword; based on a comparison learning mode, performing data enhancement processing on sentences in the case data to be classified to obtain at least one enhanced sentence; performing text vectorization on all the enhanced sentences to obtain sentence vectors corresponding to each enhanced sentence; splicing all the keyword vectors and all the sentence vectors according to a preset sequence to obtain spliced vectors; and extracting context information and performing prediction classification on the spliced vector to obtain a classification result corresponding to the case data to be classified. By extracting keywords and extracting text vectors of the sentences after data enhancement, the accuracy of case classification prediction is improved by extracting the deep-level information in the keyword vectors and the sentence vectors.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied.
FIG. 2 is a flow chart of one embodiment of a case classification prediction method of the present application.
Fig. 3 is a schematic structural diagram of an embodiment of a case classification prediction apparatus according to the present application.
FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, as shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like.
The terminal devices 101, 102, 103 may be various electronic devices having display screens and supporting web browsing, including but not limited to smart phones, tablet computers, E-book readers, MP3 players (Moving Picture E interface shows a properties Group Audio Layer III, motion Picture experts compress standard Audio Layer 3), MP4 players (Moving Picture E interface shows a properties Group Audio Layer IV, motion Picture experts compress standard Audio Layer 4), laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
The case classification prediction method provided by the embodiment of the present application is executed by a server, and accordingly, a case classification prediction apparatus is provided in the server.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. Any number of terminal devices, networks and servers may be provided according to implementation needs, and the terminal devices 101, 102 and 103 in this embodiment may specifically correspond to an application system in actual production.
Referring to fig. 2, fig. 2 shows a case classification prediction method according to an embodiment of the present invention, which is described by taking the case classification prediction method applied to the server side in fig. 1 as an example, and is described in detail as follows.
S201, acquiring case data to be classified.
In step S201, the case data to be classified refers to case data with unknown case type.
It should be noted that the case data to be classified includes, but is not limited to, eco-friendly cases, administrative litigation cases. The embodiment of the application takes an ecological environment-friendly case as an example for explanation.
When the case data to be classified is an ecological environment-friendly case, case types comprise pollution environment crime, forest malting crime, environment supervision malting crime, forest robbing crime, illegal occupation of agricultural land crime, illegal fishing aquatic product crime, illegal purchasing, transportation robbing, excessive forest crime, illegal hunting, precious and endangered wild animal crime, illegal felling and damage state protection plant crime and illegal mining crime.
The above-mentioned ways of obtaining case data to be classified include, but are not limited to, classical case obtaining and Chinese referee document network obtaining.
When the case data to be classified is an ecological environment-friendly case, the case data to be classified acquired by the classical case can be case data to be classified which is confirmed by acquiring CAIL2018 (French grinding cup) data. The above-mentioned Chinese referee's paperwork network obtains the text data of the judgement books.
By acquiring the case data to be classified, the type prediction of the case data to be classified is conveniently carried out subsequently, so that the accuracy of case classification prediction is improved, and the processing efficiency of case classification prediction is improved.
S202, extracting keywords from case data to be classified based on a preset keyword extraction mode, and performing text vectorization on each extracted keyword to obtain a keyword vector corresponding to the keyword.
In step S202, the preset keyword extraction manner is a manner of extracting keywords from the case data to be classified.
The preset keyword extraction method includes, but is not limited to, a matching extraction method and a probability calculation method. The matching extraction method comprises the steps of establishing a keyword library of cases, carrying out word segmentation on case data to be classified to obtain word segmentation results, carrying out matching judgment on the word segmentation results and keywords in the keyword library, and if the matching judgment is passed, determining that the word segmentation results are the keywords. The probability calculation method is to extract fact description segments of cases based on rules, perform word segmentation on the fact description segments to obtain word segmentation results, perform probability calculation on the word segmentation results to obtain probability values corresponding to the word segmentation results, and take the word segmentation results with the probability values meeting preset conditions as keywords. It should be noted here that the preset keyword extraction manner is not limited to the above method, and the specific extraction manner is specifically set according to real needs.
The text vectorization method includes but is not limited to doc2Vec method and Word2Vec method. The doc2vec method constructs a model through a single-layer simple neural network, paragraph vectors are added into a hidden layer when the hidden layer is constructed, and a backward propagation gradient is obtained together with other word vectors, so that text vectorization is realized. The Word2Vec method is based on Word embedding, words appearing in a context environment are predicted by utilizing a deep learning idea, Word vectors trained by the Word2Vec can well measure similarity between the words, all words are projected to a K-dimensional vector space, each Word can be represented by one K-dimensional vector, and then processing of text contents is simplified into vector operation in the K-dimensional vector space.
Preferably, Word2Vec is adopted in the embodiment of the application to perform text vectorization to obtain a keyword vector.
The method comprises the steps of extracting keywords from case data to be classified, and performing text vectorization on each extracted keyword to obtain keyword vectors corresponding to the keywords, so that the keyword vectors in the case data to be classified can be extracted deeply, and the accuracy of case classification prediction can be improved.
S203, based on the comparison learning mode, carrying out data enhancement processing on the sentences in the case data to be classified to obtain at least one enhanced sentence.
In step S203, the preset data enhancement method refers to a method of expanding data.
The preset data enhancement mode includes, but is not limited to, a supervised data enhancement mode and an unsupervised data enhancement mode. The supervised data enhancement mode includes, but is not limited to, single sample data enhancement and multi-sample data enhancement. Unsupervised data enhancements include, but are not limited to, GAN (generative adaptive networks), Simcse (Simple contrast Learning of sequence Embeddings, Sentence vector representation based on contrast Learning).
Preferably, the embodiment of the application adopts Simcse for data enhancement. When the case data to be classified is an ecological environment-friendly case, the problem of sample imbalance of the ecological environment-friendly case exists, and the case data to be classified is enhanced in a comparison learning mode through an unsupervised Simcse method.
The sentence in the case data to be classified is subjected to data enhancement processing to obtain at least one enhanced sentence, so that the subsequent text vector extraction of the sentence after data enhancement is facilitated, the sentence vector and the keyword vector are combined to analyze and extract the deep information of the case data to be classified, and the accuracy of case classification prediction is improved.
S204, carrying out text vectorization on all the enhanced sentences to obtain sentence vectors corresponding to each enhanced sentence.
In step S204, the above-described text vectorization refers to a process of converting an enhanced sentence into a sentence vector.
The text vectorization is carried out on the enhanced sentences to obtain sentence vectors corresponding to each enhanced sentence, so that the sentence vectors and the keyword vectors are combined to analyze and extract deep information of case data to be classified, and the accuracy of case classification prediction is improved.
S205, splicing all the keyword vectors and all the sentence vectors according to a preset sequence to obtain spliced vectors.
In step S205, the preset sequence refers to a sequence in which the keyword vector and the sentence vector are concatenated. It should be noted that the preset sequence includes, but is not limited to (keyword vector, sentence vector), (sentence vector, keyword vector), (keyword vector 1, sentence vector 2, keyword vector 2, sentence vector 3, … …), wherein the sentences corresponding to sentence vector 1 and sentence vector 2 in (keyword vector 1, sentence vector 2, keyword vector 2, sentence vector 3, … …) include the keywords corresponding to keyword vector 1.
Preferably, the present application takes the order of (keyword vector, sentence vector).
It should be noted here that, because the number of the keyword vectors and the sentence vectors is not unique, the internal concatenation order of the keyword vectors is spliced according to the order of occurrence of the keyword vectors in the case data to be classified, and the internal concatenation order of the sentence vectors is spliced according to the priority of the sentence vectors. The specific splicing sequence can be adjusted according to actual conditions.
The splicing vector is obtained through the method, and the deep-level information of the case data to be classified is extracted through analyzing the splicing vector, so that the accuracy of case classification prediction is improved.
And S206, extracting context information and carrying out prediction classification on the spliced vectors to obtain a classification result corresponding to case data to be classified.
In step S206, the context information extraction is specifically a processing method of extracting the deep level information of the stitching vector.
The prediction classification refers to a process of predicting classification categories of case data to be classified.
Context information extraction and prediction classification are carried out on the spliced vectors, and deep-level information of case data to be classified is extracted, so that the accuracy of case classification prediction is improved.
In the embodiment, case data to be classified is obtained; extracting keywords from case data to be classified based on a preset keyword extraction mode, and performing text vectorization on each extracted keyword to obtain a keyword vector corresponding to the keyword; based on a comparison learning mode, carrying out data enhancement processing on sentences in case data to be classified to obtain at least one enhanced sentence; performing text vectorization on all the enhanced sentences to obtain sentence vectors corresponding to each enhanced sentence; splicing all keyword vectors and all sentence vectors according to a preset sequence to obtain spliced vectors; and extracting context information and performing prediction classification on the splicing vector to obtain a classification result corresponding to case data to be classified. By extracting keywords and extracting text vectors of the sentences after data enhancement, the accuracy of case classification prediction is improved by extracting the deep-level information in the keyword vectors and the sentence vectors.
In some optional implementation manners of the present embodiment, in step S201, the step of acquiring case data to be classified includes steps S101 to S103.
S101, acquiring case documents.
S102, extracting the text of the case document based on a preset rule extraction mode to obtain a fact description text segment corresponding to the case document.
S103, preprocessing the fact description text segment to obtain case data to be classified.
In step S101, the case document refers to a decision book corresponding to the case.
Here, it should be noted that the case corresponding decisions include a first-trial case decision and a second-trial case decision. The case document needs to be preprocessed, and data duplication removal, data cleaning, stop word removal and the like are carried out.
Preferably, the embodiment of the present application employs a trial case decision.
The above case document acquisition modes include but are not limited to data crawling and classical case.
For example, a large amount of judgment text data is acquired on a Chinese referee document network through a web crawler technology and combined with CAIL2018 method cup data into a case document.
The text template corresponding to the case document refers to a fixed format corresponding to the case document.
For example, the fixed format of the fact description passage in the decision book is: the first word features are "examined and found", the last word features are "the above mentioned control", "the above mentioned fact" and "the present hospital opinion", etc. The fixed format of the information for judging the criminal name in the text is' defendant.
In step S102, the preset rule extraction manner is to extract a certain text in the text template. For example, in training a model, factual description passage and criminal name information are extracted. When case type presetting is carried out, the extracted text refers to a fact description text segment.
The preset rule extraction mode is realized by a re module in a python language environment.
When the extracted text is a fact description segment, the obtained extraction rule is to extract the segment between the first sentence and the final word.
When the extracted text is the information of the criminal name, the obtained extraction rule is 'defended person' and 'criminal'.
It should be noted here that the specific extraction rule is adjusted according to the actual application scenario.
The first embodiment is used to show the aforementioned preset rule extraction method, and as shown in the first embodiment, the rule extracted by the preset rule is as follows.
Table one, example of preset rule extraction method
Figure 893928DEST_PATH_IMAGE001
In step S103, the preprocessing includes, but is not limited to, deduplication, data cleansing, and data labeling.
In the embodiment, the case document is subjected to text extraction in a preset rule extraction mode to obtain the fact description text segment corresponding to the case document, and the fact description text segment is preprocessed to obtain the case data to be classified, so that the case data to be classified are subjected to deep analysis subsequently, and the accuracy of case classification prediction is improved.
In some optional implementation manners of this embodiment, in step S202, based on a preset keyword extraction manner, extracting keywords from the case data to be classified, and performing text vectorization on each extracted keyword to obtain a keyword vector corresponding to the keyword includes steps S2021 to S2026.
S2021, performing word segmentation processing on case data to be classified based on a preset word segmentation mode to obtain at least one word segmentation result.
S2022, determining a corpus corresponding to case data to be classified according to all word segmentation results.
S2023, taking the word of the case data to be classified appearing in the corpus as a candidate word, and adding the candidate word into a candidate word set.
S2024, based on the TF-IDF algorithm, performing keyword probability calculation on each candidate word in the candidate word set to obtain keyword probability corresponding to each candidate word.
S2025, selecting candidate words with the same number as the preset keywords as the keywords according to the sequence of the probability of the keywords from large to small.
S2026, aiming at each keyword, performing text vectorization on the keyword to obtain a keyword vector corresponding to the keyword.
In step S2021, the preset word segmentation mode refers to a mode of segmenting words of case data to be classified.
The preset word segmentation mode includes, but is not limited to, a word segmentation method based on a word list, a word segmentation method based on a statistical model, and a word segmentation method based on sequence tagging.
Preferably, the jieba module of Python is used in the examples of the present application.
In step S2022, it is specifically: and summarizing and de-duplicating all word segmentation results, and determining a corpus corresponding to case data to be classified.
In step S2024, it is specifically: and calculating a TF value and an IDF value corresponding to each candidate word in the candidate word set based on a TF-IDF algorithm. And based on the TF value and the IDF value of each candidate word, carrying out normalization processing on the candidate word to obtain a normalization value corresponding to the candidate word, and taking the normalization value as the keyword probability of the candidate word.
It should be noted here that, when the normalization value corresponding to the candidate word is larger, the probability that the candidate word is used as the keyword is larger.
Preferably, the embodiment of the application extracts candidate words 10 before the normalized value of the case data to be classified as the keywords.
In step S2026, specifically, based on Word2Vec, text vectorization is performed on the keywords for each keyword to obtain a keyword vector corresponding to the keyword.
In the embodiment, the word segmentation processing is carried out on the case data to be classified, the corpus is determined, the keyword probability of the candidate words is calculated, the keywords are selected, and the corresponding keyword vectors are obtained, so that the deep information extraction can be carried out on the case data to be classified by subsequently combining the keyword vectors, and the accuracy of case classification prediction is improved.
In some optional implementation manners of this embodiment, in step S203, based on the contrast learning manner, the step of performing data enhancement processing on the sentences in the case data to be classified to obtain at least one enhanced sentence includes steps S301 to S302.
S301, constructing positive sample pairs of sentences in case data to be classified based on a preset positive sample construction mode to obtain positive sample pairs with the same number as the preset number.
S302, sequentially enhancing data of all positive sample pairs by adopting a contrast learning mode to obtain an enhanced sentence corresponding to each positive sample.
In step S301, the preset positive sample construction method is to construct a positive sample pair based on Dropout in the Simcse method. By adopting a plurality of different Dropout masks for the same sentence, a plurality of text vectors are obtained as positive sample pairs.
The preset number refers to the number of positive sample pairs corresponding to case data to be classified.
The preset number of obtaining methods include but are not limited to obtaining empirical values and setting by human.
In step S302, the above-mentioned comparative learning manner is a manner of obtaining a sentence vector of a fact description segment based on the comparative learning of Simcse.
In the embodiment, the enhanced sentences after data enhancement are obtained in a contrast learning mode, which is beneficial to text vectors containing rich semantic information, so that the accuracy of case classification prediction is improved.
In some optional implementation manners of this embodiment, in step S206, the step of performing context information extraction and prediction classification on the spliced vector to obtain a classification result corresponding to the case data to be classified includes steps S601 to S603.
S601, extracting context information of the spliced vector based on a trained classifier to obtain a context information vector, wherein the classifier comprises a full connection layer, a Dropout layer and a Softmax layer.
And S602, carrying out full connection on the context information vector based on the full connection layer and the Dropout layer to obtain a full connection vector.
And S603, classifying the full-connection vectors based on the Softmax layer to obtain a classification result corresponding to the case data to be classified.
In step S601, preferably, the trained classifier is a BiLSTM classifier.
In step S602, the fully connected layer and the dropout layer are used for performing feature dimensionality reduction and obtaining a classification result corresponding to case data to be classified, and the dropout layer is used for preventing overfitting of model training.
In step S603, the Softmax layer obtains a class probability of case data to be classified.
The classification result refers to the case type corresponding to the case data to be classified.
The method comprises the following steps: and classifying the full-connection vectors based on the Softmax layer to obtain the class probability of the case data to be classified, and determining the classification result of the case data to be classified according to the class probability of the case data to be classified.
In the embodiment, the trained full-link layer, Dropout layer and Softmax layer of the BilSTM classifier perform context information extraction, full-link and classification processing on the spliced vectors, and determine the classification result of the case data to be classified.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Fig. 3 is a schematic block diagram of a case classification prediction apparatus corresponding to the case classification prediction method according to the above-described embodiment. As shown in fig. 3, the case classification prediction device includes a case data acquiring module 31 to be classified, a keyword vector extracting module 32, an enhanced sentence acquiring module 33, a sentence vector acquiring module 34, a splicing module 35 and a classification result acquiring module 36. Each functional block is described in detail below.
And the case data to be classified acquiring module 31 is used for acquiring the case data to be classified.
And the keyword vector extraction module 32 is configured to perform keyword extraction on case data to be classified based on a preset keyword extraction mode, and perform text vectorization on each extracted keyword to obtain a keyword vector corresponding to the keyword.
And the enhanced sentence acquisition module 33 is configured to perform data enhancement processing on the sentences in the case data to be classified based on the comparison learning manner to obtain at least one enhanced sentence.
And a sentence vector obtaining module 34, configured to perform text vectorization on all the enhanced sentences to obtain a sentence vector corresponding to each enhanced sentence.
And the splicing module 35 is configured to splice all the keyword vectors and all the sentence vectors according to a preset sequence to obtain a spliced vector.
And the classification result obtaining module 36 is configured to perform context information extraction and predictive classification on the spliced vectors to obtain a classification result corresponding to case data to be classified.
Optionally, the case data acquiring module to be classified 31 further comprises.
And the case document acquisition unit is used for acquiring case documents.
And the fact description segment acquiring unit is used for extracting the text of the case document based on a preset rule extracting mode to obtain the fact description segment corresponding to the case document.
And the preprocessing unit is used for preprocessing the fact description text segment to obtain case data to be classified.
Optionally, the keyword obtaining module 32 further includes.
And the word segmentation unit is used for carrying out word segmentation processing on the case data to be classified based on a preset word segmentation mode to obtain at least one word segmentation result.
And the corpus determining unit is used for determining a corpus corresponding to the case data to be classified according to all word segmentation results.
And the candidate keyword set acquisition unit is used for taking the words of the case data to be classified appearing in the corpus as candidate words and adding the candidate words into the candidate word set.
And the keyword probability obtaining unit is used for carrying out keyword probability calculation on each candidate word in the candidate word set based on the TF-IDF algorithm to obtain the keyword probability corresponding to each candidate word.
And the keyword acquisition unit is used for performing text vectorization on the keywords aiming at each keyword to obtain a keyword vector corresponding to the keyword.
Optionally, the enhanced sentence acquisition module 33 further includes.
And the positive sample pair acquisition unit is used for carrying out positive sample pair construction on the sentences in the case data to be classified based on a preset positive sample construction mode to obtain positive sample pairs with the same quantity as the preset quantity.
And the enhanced sentence acquisition unit is used for sequentially enhancing the data of all the positive sample pairs by adopting a contrast learning mode to obtain an enhanced sentence corresponding to each positive sample.
Optionally, the classification result obtaining module 36 further includes.
And the context information vector acquisition unit is used for extracting context information from the spliced vector based on the trained classifier to obtain a context information vector, wherein the classifier comprises a full connection layer, a Dropout layer and a Softmax layer.
And the full-connection vector acquisition unit is used for performing full connection on the context information vector based on the full-connection layer and the Dropout layer to obtain a full-connection vector.
And the classification unit is used for classifying the full-connection vectors based on the Softmax layer to obtain a classification result corresponding to the case data to be classified.
For the specific definition of the case classification predicting device, reference may be made to the above definition of the case classification predicting method, which is not described herein again. The modules in the case classification prediction device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.
The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only the computer device 4 having the components connection memory 41, processor 42, network interface 43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as program codes for controlling electronic files. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute the program code stored in the memory 41 or process data, such as program code for executing control of an electronic file.
The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.
The present application further provides another embodiment, which is to provide a computer-readable storage medium storing an interface display program, which is executable by at least one processor to cause the at least one processor to execute the steps of the case classification prediction method as described above.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims (10)

1. A case classification prediction method is characterized by comprising the following steps:
acquiring case data to be classified;
extracting keywords from the case data to be classified based on a preset keyword extraction mode, and performing text vectorization on each extracted keyword to obtain a keyword vector corresponding to the keyword;
based on a comparison learning mode, performing data enhancement processing on sentences in the case data to be classified to obtain at least one enhanced sentence;
performing text vectorization on all the enhanced sentences to obtain sentence vectors corresponding to each enhanced sentence;
splicing all the keyword vectors and all the sentence vectors according to a preset sequence to obtain spliced vectors;
and extracting context information and performing prediction classification on the spliced vector to obtain a classification result corresponding to the case data to be classified.
2. The case classification prediction method according to claim 1, characterized in that said step of obtaining case data to be classified comprises:
acquiring a case document;
extracting texts of the case documents based on a preset rule extraction mode to obtain fact description segments corresponding to the case documents;
and preprocessing the fact description file segment to obtain case data to be classified.
3. The case classification prediction method as claimed in claim 1, wherein the step of extracting keywords from the case data to be classified based on a preset keyword extraction manner, and performing text vectorization on each extracted keyword to obtain a keyword vector corresponding to the keyword comprises:
performing word segmentation processing on the case data to be classified based on a preset word segmentation mode to obtain at least one word segmentation result;
determining a corpus corresponding to the case data to be classified according to all the word segmentation results;
taking the word of the case data to be classified appearing in the corpus as a candidate word, and adding the candidate word into a candidate word set;
performing keyword probability calculation on each candidate word in the candidate word set based on a TF-IDF algorithm to obtain keyword probability corresponding to each candidate word;
selecting candidate words with the same number as the preset keywords as the keywords according to the sequence of the probability of the keywords from large to small;
and aiming at each keyword, performing text vectorization on the keyword to obtain a keyword vector corresponding to the keyword.
4. The case classification prediction method according to claim 1, characterized in that the step of performing data enhancement processing on the sentences in the case data to be classified based on a contrast learning manner to obtain at least one enhanced sentence comprises:
constructing positive sample pairs of sentences in the case data to be classified based on a preset positive sample construction mode to obtain positive sample pairs with the same number as the preset number;
and sequentially performing data enhancement on all the positive sample pairs by adopting a contrast learning mode to obtain an enhanced sentence corresponding to each positive sample.
5. The case classification prediction method according to claim 1, wherein the step of performing context information extraction and prediction classification on the spliced vector to obtain the classification result corresponding to the case data to be classified comprises:
extracting context information of the spliced vector based on a trained classifier to obtain a context information vector, wherein the classifier comprises a full connection layer, a Dropout layer and a Softmax layer;
fully connecting the context information vectors based on the fully-connected layer and the Dropout layer to obtain fully-connected vectors;
and classifying the full-connection vectors based on the Softmax layer to obtain a classification result corresponding to the case data to be classified.
6. A case classification prediction apparatus, characterized by comprising:
the case data to be classified acquiring module is used for acquiring the case data to be classified;
the keyword vector extraction module is used for extracting keywords from the case data to be classified based on a preset keyword extraction mode, and performing text vectorization on each extracted keyword to obtain a keyword vector corresponding to the keyword;
the enhanced sentence acquisition module is used for carrying out data enhancement processing on sentences in the case data to be classified based on a comparison learning mode to obtain at least one enhanced sentence;
a sentence vector acquisition module, configured to perform text vectorization on all the enhanced sentences to obtain a sentence vector corresponding to each enhanced sentence;
the splicing module is used for splicing all the keyword vectors and all the sentence vectors according to a preset sequence to obtain spliced vectors;
and the classification result acquisition module is used for extracting context information and predicting and classifying the splicing vector to obtain a classification result corresponding to the case data to be classified.
7. The case classification prediction device according to claim 6, characterized in that said case data to be classified acquisition module comprises:
the case document acquisition unit is used for acquiring case documents;
the fact description segment acquiring unit is used for extracting the text of the case document based on a preset rule extracting mode to obtain a fact description segment corresponding to the case document;
and the preprocessing unit is used for preprocessing the fact description text segment to obtain case data to be classified.
8. The case classification prediction apparatus according to claim 6, wherein the keyword obtaining module comprises:
the word segmentation unit is used for carrying out word segmentation processing on the case data to be classified based on a preset word segmentation mode to obtain at least one word segmentation result;
a corpus determining unit, configured to determine a corpus corresponding to the case data to be classified according to all the word segmentation results;
a candidate keyword set acquisition unit, configured to use a word in the corpus in which the case data to be classified appears as a candidate word, and add the candidate word to a candidate word set;
a keyword probability obtaining unit, configured to perform keyword probability calculation on each candidate word in the candidate word set based on a TF-IDF algorithm to obtain a keyword probability corresponding to each candidate word;
and the keyword acquisition unit is used for performing text vectorization on the keywords aiming at each keyword to obtain a keyword vector corresponding to the keyword.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the case classification prediction method according to any of claims 1 to 5 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a case classification prediction method according to any one of claims 1 to 5.
CN202210437363.6A 2022-04-25 2022-04-25 Case classification prediction method and device, computer equipment and storage medium Pending CN114547315A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210437363.6A CN114547315A (en) 2022-04-25 2022-04-25 Case classification prediction method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210437363.6A CN114547315A (en) 2022-04-25 2022-04-25 Case classification prediction method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114547315A true CN114547315A (en) 2022-05-27

Family

ID=81666850

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210437363.6A Pending CN114547315A (en) 2022-04-25 2022-04-25 Case classification prediction method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114547315A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115935245A (en) * 2023-03-10 2023-04-07 吉奥时空信息技术股份有限公司 Automatic classification and distribution method for government affair hotline cases
CN116108185A (en) * 2023-03-09 2023-05-12 中关村科学城城市大脑股份有限公司 Attention-enhancing pre-training method and device for text classification
CN116521824A (en) * 2023-04-18 2023-08-01 北京数美时代科技有限公司 Method, system and electronic equipment for enhancing sample by using keywords
CN117114148A (en) * 2023-08-18 2023-11-24 湖南工商大学 Lightweight federal learning training method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033155A (en) * 2018-06-13 2018-12-18 中国电子科技集团公司电子科学研究院 Search mail content and method, device, terminal and storage medium
CN110162787A (en) * 2019-05-05 2019-08-23 西安交通大学 A kind of class prediction method and device based on subject information
CN110990559A (en) * 2018-09-29 2020-04-10 北京国双科技有限公司 Method and apparatus for classifying text, storage medium, and processor
CN111177367A (en) * 2019-11-11 2020-05-19 腾讯科技(深圳)有限公司 Case classification method, classification model training method and related products
CN112989822A (en) * 2021-04-16 2021-06-18 北京世纪好未来教育科技有限公司 Method, device, electronic equipment and storage medium for recognizing sentence categories in conversation
CN113343706A (en) * 2021-05-27 2021-09-03 山东师范大学 Text depression tendency detection system based on multi-modal features and semantic rules
CN113361252A (en) * 2021-05-27 2021-09-07 山东师范大学 Text depression tendency detection system based on multi-modal features and emotion dictionary
CN114385808A (en) * 2020-10-16 2022-04-22 顺丰科技有限公司 Text classification model construction method and text classification method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033155A (en) * 2018-06-13 2018-12-18 中国电子科技集团公司电子科学研究院 Search mail content and method, device, terminal and storage medium
CN110990559A (en) * 2018-09-29 2020-04-10 北京国双科技有限公司 Method and apparatus for classifying text, storage medium, and processor
CN110162787A (en) * 2019-05-05 2019-08-23 西安交通大学 A kind of class prediction method and device based on subject information
CN111177367A (en) * 2019-11-11 2020-05-19 腾讯科技(深圳)有限公司 Case classification method, classification model training method and related products
CN114385808A (en) * 2020-10-16 2022-04-22 顺丰科技有限公司 Text classification model construction method and text classification method
CN112989822A (en) * 2021-04-16 2021-06-18 北京世纪好未来教育科技有限公司 Method, device, electronic equipment and storage medium for recognizing sentence categories in conversation
CN113343706A (en) * 2021-05-27 2021-09-03 山东师范大学 Text depression tendency detection system based on multi-modal features and semantic rules
CN113361252A (en) * 2021-05-27 2021-09-07 山东师范大学 Text depression tendency detection system based on multi-modal features and emotion dictionary

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵宏田: "《用户画像》", 30 October 2019 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108185A (en) * 2023-03-09 2023-05-12 中关村科学城城市大脑股份有限公司 Attention-enhancing pre-training method and device for text classification
CN115935245A (en) * 2023-03-10 2023-04-07 吉奥时空信息技术股份有限公司 Automatic classification and distribution method for government affair hotline cases
CN116521824A (en) * 2023-04-18 2023-08-01 北京数美时代科技有限公司 Method, system and electronic equipment for enhancing sample by using keywords
CN117114148A (en) * 2023-08-18 2023-11-24 湖南工商大学 Lightweight federal learning training method
CN117114148B (en) * 2023-08-18 2024-04-09 湖南工商大学 Lightweight federal learning training method

Similar Documents

Publication Publication Date Title
CN110163478B (en) Risk examination method and device for contract clauses
CN114547315A (en) Case classification prediction method and device, computer equipment and storage medium
CN112215008B (en) Entity identification method, device, computer equipment and medium based on semantic understanding
CN112818093B (en) Evidence document retrieval method, system and storage medium based on semantic matching
WO2022048363A1 (en) Website classification method and apparatus, computer device, and storage medium
CN112101041A (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
CN113722438B (en) Sentence vector generation method and device based on sentence vector model and computer equipment
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
CN111783471B (en) Semantic recognition method, device, equipment and storage medium for natural language
CN112287069A (en) Information retrieval method and device based on voice semantics and computer equipment
CN114357117A (en) Transaction information query method and device, computer equipment and storage medium
CN116775872A (en) Text processing method and device, electronic equipment and storage medium
CN114780746A (en) Knowledge graph-based document retrieval method and related equipment thereof
CN111291551B (en) Text processing method and device, electronic equipment and computer readable storage medium
CN112395421A (en) Course label generation method and device, computer equipment and medium
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN114202443A (en) Policy classification method, device, equipment and storage medium
CN116644183B (en) Text classification method, device and storage medium
CN111191011B (en) Text label searching and matching method, device, equipment and storage medium
CN113569118A (en) Self-media pushing method and device, computer equipment and storage medium
CN114780724A (en) Case classification method and device, computer equipment and storage medium
CN112364649B (en) Named entity identification method and device, computer equipment and storage medium
CN115238077A (en) Text analysis method, device and equipment based on artificial intelligence and storage medium
CN114580398A (en) Text information extraction model generation method, text information extraction method and device
CN114328894A (en) Document processing method, document processing device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220527

RJ01 Rejection of invention patent application after publication