CN113032544B

CN113032544B - Case automatic processing method and device based on big data and terminal equipment

Info

Publication number: CN113032544B
Application number: CN202110542723.4A
Authority: CN
Inventors: 周金明; 陈贵龙
Original assignee: Nanjing Inspector Intelligent Technology Co Ltd
Current assignee: Nanjing Inspector Intelligent Technology Co Ltd
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2021-08-20
Anticipated expiration: 2041-05-19
Also published as: CN113032544A

Abstract

The invention discloses a case automatic processing method, a device and terminal equipment based on big data, wherein the method comprises the following steps of 1, acquiring all processed historical cases as historical cases to be matched, and acquiring a central thought vector of the cases; step 2, carrying out rough arrangement matching on the new case according to the processed historical case; and 3, after the rough arrangement result is obtained, calculating the fine arrangement similarity through a text similarity matching algorithm, and intelligently matching the processing result of the new case. The processing result of the new case is intelligently obtained by performing rough arrangement matching on the new case and calculating the fine arrangement similarity by adopting a text similarity matching algorithm, so that the case processing efficiency is greatly improved, and a large amount of manpower and material resources are saved.

Description

Case automatic processing method and device based on big data and terminal equipment

Technical Field

The invention relates to the field of big data case processing and natural language processing research, in particular to a case automatic processing method and device based on big data and terminal equipment.

Background

Most of current case processing is traditional manual processing, problems are solved manually, however, due to the fact that the population base of China is large, social problems are complex, the total number of cases is large, and related fields are complex. The staff need provide the solution according to self knowledge level, professional accumulation and work experience, and is time-consuming and labor-consuming. The staff needs to manually decide the approximate type of the case according to the case text and determine a corresponding solution strategy, and the case automatic processing method which is intelligent is lacked to automatically process the case.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a case automatic processing method, a device and terminal equipment based on big data, wherein the processing result of a new case is intelligently obtained by carrying out rough layout matching on the new case and adopting a text similarity matching algorithm to calculate the refined similarity, so that the case processing efficiency is greatly improved, and a large amount of manpower and material resources are saved. The technical scheme is as follows:

in a first aspect, a big data based case automatic processing method is provided, which comprises the following steps:

step 1, acquiring all history cases after finishing processing as history cases to be matched, wherein each history case comprises case description and processing results of the case, extracting a plurality of keywords for each case according to the case description and processing results of the case, calculating a word vector of each keyword through a Chinese BERT model, and averaging the word vectors of the keywords to obtain a central thought vector of the case.

Step 2, carrying out rough arrangement matching on the new case according to the processed historical case;

for a new case, firstly, a plurality of keywords are selected from the case description, and the synonyms of the keywords are added to form a search term set W { W }_1，w₂，……，w_nN is the number of the search words, the word vector of each search word is obtained through the calculation of a Chinese BERT model, the word vector of each search word and the central thought vector are standardized, namely the model length of the vector is divided by the vector, so that the standardized vector model length is 1, and the new case search word w is recorded_iNormalized word vector of A_iAnd B is the standardized vector of the thought in a certain historical case.

Calculating the rough-row similarity of the new case and any historical case, wherein the rough-row similarity is the inner product average value of the standardized word vector of the new case search word and the standardized vector of the central idea of a certain historical case, namely the rough-row similarity C is as follows:

。

and acquiring the historical cases with the rough ranking similarity larger than a given threshold, and selecting the historical cases with the rough ranking similarity value of N before ranking as rough ranking results.

Step 3, after the rough arrangement result is obtained, calculating the fine arrangement similarity through a text similarity matching algorithm, and intelligently matching the processing result of the new case;

and constructing a case description-case description matching degree model and a case description-processing method matching degree model, and training the two models, wherein the two models have the same structure and are both BERT + two-classification frames.

Training a case description-case description matching degree model:

for any two historical cases, if the case descriptions of the two cases are the same fact, the two cases are considered to be matched, otherwise, the two cases are considered to be unmatched, and therefore training samples are obtained.

The training process is as follows: the method comprises the steps of taking two historical cases as a text 1 and a text 2 respectively, converting each word of the two texts into a word vector, inputting a BERT Model, inputting a vector output from the first [ CLS ] position of the last layer of the BERT Model into a linear binary Model to obtain a specific matching score with a value range of 0-1, considering the matching score to be matched when the matching score is larger than or equal to alpha, and determining the alpha belongs to [0.5, 0.6], otherwise, considering the matching score to be unmatched, and obtaining a case description-case description matching degree Model1 through training sample training parameters.

Training a case description-processing method matching degree model:

for any two historical cases, if the case description and the processing method of the two cases are matched, the two cases are considered to be matched, otherwise, the two cases are considered to be unmatched, and therefore training samples are obtained.

The training process is as follows: respectively taking a case description and a processing method as a text 1 and a text 2, converting each word of the two texts into a word vector, inputting the word vector into a BERT Model, inputting a vector output from the first [ CLS ] position of the last layer of the BERT Model into a linear binary Model to obtain a specific matching score with the value range of 0-1, considering the matching beta belongs to [0.6, 0.7] when the matching score is larger than or equal to beta, and considering the matching beta belongs to [0.6, 0.7] otherwise, considering the matching beta is not matched, and obtaining a case description-processing method matching degree Model2 by training sample training parameters.

After the Model1 and the Model2 are obtained through training, for a new case, the matching degree of the new case and each historical case in the rough ranking result is calculated in sequence: for a historical case H, splicing the case description of the new case with the case description of the historical case H to input a Model1 to obtain a matching score S1, and splicing the case description of the new case with a processing method of the historical case H to input a Model2 to obtain a matching score S2, wherein the refined similarity S between the historical case H and the new case is as follows:

wherein X1 and X2 are respectively the weight of the matching score S1 and the matching score S2,

and sequentially calculating the fine-ranking similarity of each historical case in the new case and the coarse-ranking result, selecting the historical case with the maximum fine-ranking similarity, and taking the processing method of the historical case with the maximum fine-ranking similarity as the processing result of the new case.

Preferably, the method further comprises: when the history case after the processing is finished is obtained in the step 1 or a new case is obtained in the step 2, if the case is in a text input form, directly obtaining a text as case description, and if the case is in a pdf or picture form, obtaining the text as case description through image recognition.

Preferably, during the training sample acquisition process of the Model2, the method further includes: if the processing method of other cases is also applicable to the present case, the case description of the present case and the processing method are considered to be matched.

Preferably, the method further comprises: in the training process of the two similarity models, the training of the augmentation samples is added, namely: after each word in the text is converted into a word vector, a part of word vectors are randomly selected, one or more dimensions are randomly selected to add or subtract a minimum value, and then the minimum value is input into a similarity model for training.

Preferably, the method further comprises: selecting a historical case with the maximum fine ranking similarity, and replacing the selected historical case with the maximum fine ranking similarity by taking the processing method of the historical case with the maximum fine ranking similarity as the processing result of the new case: selecting a plurality of historical cases with high precision ranking similarity, and synthesizing the processing methods of the plurality of historical cases to obtain the processing result of the new case.

Preferably, the selecting a history case with the maximum fine rule similarity, and the selecting the processing method of the history case with the maximum fine rule similarity as the processing result of the new case specifically include: and if the maximum fine ranking similarity is larger than the threshold value, directly selecting the processing method of the historical case as a result, and otherwise, generating a processing strategy through a model.

Further, a processing strategy is generated through the model, specifically: and constructing a seq2seq model, wherein a BERT model is selected as an encoding module encoder of the seq2seq model, and a BERT model is selected as a decoding module decoder of the seq2seq model. The input is case description of a case, the output is a corresponding processing method, a Model3 is obtained by training a Model through historical cases and the corresponding processing method, and the case description of a new case is input into the Model3 to obtain a generated processing method.

Further, the method for processing the Model3 further judges the rationality: and calculating the matching degree between the case description of the new case and the processing method generated by the Model3 through the matching degree Model2, wherein if the matching degree is greater than a set threshold value, the generated processing method is more suitable and can be directly used, and otherwise, manual adjustment is performed.

Compared with the prior art, one of the technical schemes has the following beneficial effects: the processing result of the new case is intelligently obtained by performing rough-layout matching on the new case and calculating the refined-layout similarity by adopting a text similarity matching algorithm, and an intelligent automatic processing strategy is provided, so that a worker can directly adopt the intelligent automatic processing strategy as a reference; greatly improving the efficiency of case treatment and saving a large amount of manpower and material resources.

Drawings

Fig. 1 is a diagram of a matching degree model structure according to an embodiment of the present disclosure.

Detailed Description

In order to clarify the technical solution and the working principle of the present invention, the embodiments of the present disclosure will be described in further detail with reference to the accompanying drawings. All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

The terms "step 1," "step 2," "step 3," and the like in the description and claims of this application and the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the application described herein may, for example, be implemented in an order other than those described herein.

In a first aspect: the embodiment of the disclosure provides a case automatic processing method based on big data, which comprises the following steps:

fig. 1 is a structure diagram of a matching degree model provided in an embodiment of the present disclosure, and with reference to the diagram, the structure diagram mainly includes the following steps:

step 1, obtaining all history cases which are processed completely as history cases to be matched, wherein each history case comprises case description and processing results of the case, extracting a plurality of keywords (such as 3 keywords) for each case according to the case description and processing results of the case, calculating a word vector of each keyword through a Chinese BERT model, and averaging the word vectors of the keywords to obtain a central thought vector of the case.

In the actual operation process, the acquired case format is often not in a text format, such as a pdf or picture format, and the purpose of quick processing can be achieved by acquiring the text of the attachments such as the pdf and the picture by an image recognition method.

Step 2, carrying out rough arrangement matching on the new case according to the processed historical case

For a new case, firstly, a plurality of keywords are selected from the case description, synonyms similar to the plurality of keywords are added to construct a search term set W { W }_1，w₂，……，w_nAnd (6) calculating to obtain a word vector of each search word through a Chinese BERT model.

The cases in the history case that are substantially the same as the new case are first obtained. For any historical case, calculating the rough-row similarity of the historical case and the new case, firstly standardizing the word vector and the central thought vector of the search word w, namely dividing the vector by the modular length of the vector, thereby ensuring that the modular length of the vector is 1 after standardization, and recording the new case search word w_iThe normalized word vector of (a) is a,the standardized vector of the central thought of a certain historical case is B, the rough-row similarity C is the inner product average value of the standardized word vector of the new case search word and the standardized vector of the central thought of a certain historical case, namely

。

In order to ensure comparability, it can be seen that the coarse-row similarity only involves simple vector inner product operation, but does not involve complicated natural language processing and model operation, so that the computation can be parallelized rapidly. And acquiring the history cases with the rough-layout similarity larger than a given threshold, and selecting N history cases with larger values as rough-layout results from high to low according to the rough-layout similarity.

Preferably, the step 2 further includes, when a new case is obtained, directly obtaining a text as the case description if the new case is in a form of direct text input, and obtaining the text as the case description first through image recognition if the new case is in a form of pdf or picture.

Step 3, after the rough arrangement result is obtained, calculating the fine arrangement similarity through a more accurate text similarity matching algorithm, and intelligently matching the processing result of the new case;

two models were trained: the case description-case description matching degree model and the case description-processing method matching degree model. The two matching degree models have the same structure, and as shown in fig. 1, the matching degree model has a structure of BERT + two-class framework.

(1) Training case description-case description matching degree model:

for any two historical cases, if the case descriptions of the two cases are the same, the two cases are considered to be matched, otherwise, the two cases are considered to be unmatched, and therefore training samples are obtained. The training process of the case description matching degree model comprises the following steps: the two historical cases are respectively used as a text 1 and a text 2, each word of the two texts is converted into a word vector, a BERT Model is input, a vector output from the first [ CLS ] position of the last layer of the BERT Model is input into a linear binary Model to obtain a specific matching score with the value range of 0-1, the matching score is considered to be matched when the matching score is larger than or equal to 0.5, otherwise, the matching score is considered to be unmatched, and a case description-case description matching degree Model1 is obtained through training sample training parameters.

(2) Training case description-processing method matching degree Model 2:

for historical cases, case descriptions and processing methods of the same case are matched, otherwise, the case descriptions and processing methods are considered not to be matched, and training samples are obtained. Preferably, the training sample acquisition process of the Model2 further includes that if the processing method of other cases is also applicable to the present case, the case description of the present case and the processing method are considered to be matched. The training process is as follows: respectively taking a case description and a processing method as a text 1 and a text 2, converting each word of the two texts into a word vector, inputting a BERT Model, inputting a vector output from the first [ CLS ] position of the last layer of the BERT Model into a linear binary Model to obtain a matching score of which the specific value range is between 0 and 1, considering the matching score to be matched when the matching score is more than or equal to 0.6, and obtaining a case description-processing method matching degree Model2 by training sample training parameters if the matching score is not more than 0.6, otherwise, considering the matching score to be unmatched. The higher threshold value of the matching score is set in the Model2 because the matching requirement on the processing method is better, so that the matching processing method is more accurate and feasible.

Preferably, in the two similarity model training processes, the training of the augmentation sample is added, namely: after each word in the text is converted into a word vector, a part of word vectors are randomly selected, a certain dimensionality is randomly selected to add/subtract a minimum value (such as 0.0000000001) to/from the word vector, and then the word vectors are input into a similarity model for training, namely the training difficulty is improved by adding small disturbance to the word vectors, so that the model can still learn the central thought of the text when the model is subjected to replacement of a few wrongly-written characters and synonyms, and the intelligence degree of the model is improved.

After obtaining the Model1 and the Model2 through training, for the new case, the matching degree calculation is sequentially carried out with each history case in the rough ranking result: for a historical case H, splicing the case description of the new case with the case description of the historical case H to input a Model1 to obtain a matching score S1, and splicing the case description of the new case with a processing method of the historical case H to input a Model2 to obtain a matching score S2, wherein the refined similarity S between the historical case H and the new case is as follows:

wherein X1 and X2 are respectively the matching score S1 and the weight of the matching score S2, when the similarity of the refined row is more important for the similarity between the case descriptions, X1 is larger, and when the similarity of the refined row is more important for the similarity between the case descriptions and the processing method, X2 is larger.

And sequentially calculating the fine-ranking similarity of each historical case in the new case and the coarse-ranking result, selecting the historical case with the maximum fine-ranking similarity, and selecting the processing method of the historical case as the processing result of the new case.

Preferably, the selecting of the historical case with the largest refined ranking similarity and the selecting of the processing method of the historical case as the processing result of the new case specifically include: and if the maximum fine ranking similarity is larger than the threshold value, directly selecting the processing method of the historical case as a result, and otherwise, generating a processing strategy through a model.

For a history case without reference, generating a processing strategy through a model, and in step 1, when there is no history case with a fine-ranking similarity greater than a threshold (i.e. there is no history case with reference), generating a suitable processing method through the model.

Further, the processing method for Model3 generation is further judged to be reasonable: and calculating the matching degree between the case description of the new case and the processing method generated by the Model3 through the matching degree Model2, wherein if the matching degree is greater than a set threshold value, the generated processing method is more suitable and can be directly used, and otherwise, manual adjustment is performed.

In a second aspect, the embodiments of the present disclosure provide a big data based case automatic processing apparatus, which may implement or execute a big data based case automatic processing method according to any one of all possible implementation manners based on the same technical concept.

Preferably, the device comprises an acquisition unit, a coarse arranging unit and a fine arranging unit;

the acquiring unit is configured to execute the step 1 of the case automatic processing method based on big data according to any one of all possible implementation manners.

The coarse arrangement unit is configured to execute the step 2 of the automatic case processing method based on big data according to any one of all possible implementation manners.

The fine ranking unit is used for executing the step 3 of the case automatic processing method based on big data in any one of all possible implementation modes.

It should be noted that, when implementing an automatic case processing method based on big data according to the foregoing embodiments, the above-mentioned division of each functional module is merely used as an example, and in practical applications, the above-mentioned function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above-mentioned functions. In addition, the embodiment of the case automatic processing device based on big data and the embodiment of the case automatic processing method based on big data provided by the above embodiments belong to the same concept, and the specific implementation process thereof is detailed in the method embodiments and is not described herein again.

In a third aspect, an embodiment of the present disclosure provides a terminal device, where the terminal device includes any one of all possible implementation manners of the case automatic processing device based on big data.

The invention has been described above by way of example with reference to the accompanying drawings, it being understood that the invention is not limited to the specific embodiments described above, but is capable of numerous insubstantial modifications when implemented in accordance with the principles and solutions of the present invention; or directly apply the conception and the technical scheme of the invention to other occasions without improvement and equivalent replacement, and the invention is within the protection scope of the invention.

Claims

1. A big data based case automatic processing method is characterized by comprising the following steps:

step 1, acquiring all history cases which are processed completely as history cases to be matched, wherein each history case comprises case description and processing results of the case, extracting a plurality of keywords for each case according to the case description and processing results of the case, calculating a word vector of each keyword through a Chinese BERT model, and averaging the word vectors of the keywords to obtain a central thought vector of the case;

for a new case, firstly, a plurality of keywords are selected from the case description, and the synonyms of the keywords are added to form a search term set W { W }_1，w₂，……，w_nN is the number of the search words, the word vector of each search word is obtained through the calculation of a Chinese BERT model, the word vector of each search word and the central thought vector are standardized, namely the model length of the vector is divided by the vector, so that the standardized vector model length is 1, and the new case search word w is recorded_iNormalized word vector of A_iThe central thought standardized vector of a certain historical case is B;

acquiring historical cases with the rough ranking similarity larger than a given threshold, and selecting the historical cases with the rough ranking similarity value of N before ranking as rough ranking results;

constructing a case description-case description matching degree model and a case description-processing method matching degree model, and training the two models, wherein the two models have the same structure and are both BERT + two classification frames;

training a case description-case description matching degree model:

for any two historical cases, if the case descriptions of the two cases are the same, the two cases are considered to be matched, otherwise, the two cases are considered to be unmatched, and therefore training samples are obtained;

the training process is as follows: respectively taking two historical cases as a text 1 and a text 2, converting each word of the two texts into a word vector, inputting a BERT Model, inputting a vector output from the first [ CLS ] position of the last layer of the BERT Model into a linear binary Model to obtain a specific matching score with a value range of 0-1, considering the matching score to be matched when the matching score is larger than or equal to alpha, and determining the alpha belongs to [0.5, 0.6], otherwise, considering the matching score to be unmatched, and obtaining a case description-case description matching degree Model1 through training sample training parameters;

training a case description-processing method matching degree model:

for any two historical cases, if the case description and the processing method of the two cases are matched, the two cases are considered to be matched, otherwise, the two cases are considered to be unmatched, and therefore training samples are obtained;

the training process is as follows: respectively taking a case description and a processing method as a text 1 and a text 2, converting each word of the two texts into a word vector, inputting the word vector into a BERT Model, inputting a vector output from the first [ CLS ] position of the last layer of the BERT Model into a linear binary Model to obtain a matching score of which the specific value range is between 0 and 1, determining that the matching score is within the range of beta and belongs to [0.6 and 0.7] when the matching score is not less than beta, and determining that the matching score is not matched, and obtaining a case description-processing method matching degree Model2 by training sample training parameters if the matching score is not less than beta;

2. The case automatic processing method based on big data as claimed in claim 1, further comprising: when the history case after the processing is finished is obtained in the step 1 or a new case is obtained in the step 2, if the case is in a text input form, directly obtaining a text as case description, and if the case is in a pdf or picture form, obtaining the text as case description through image recognition.

3. The method as claimed in claim 1, further comprising, during the training sample acquisition process of Model 2: if the processing method of other cases is also applicable to the present case, the case description of the present case and the processing method are considered to be matched.

4. The case automatic processing method based on big data as claimed in claim 1, characterized in that in the two similarity model training process, the training of the augmentation sample is added, namely: after each word in the text is converted into a word vector, a part of word vectors are randomly selected, one or more dimensions are randomly selected to add or subtract a minimum value, and then the minimum value is input into a similarity model for training.

5. The automatic case processing method based on big data according to claim 1, characterized in that said selecting a history case with the largest refined similarity, using the processing method of the history case with the largest refined similarity as the processing result of the new case, replacing with: selecting a plurality of historical cases with high precision ranking similarity, and synthesizing the processing methods of the plurality of historical cases to obtain the processing result of the new case.

6. The automatic case processing method based on big data according to any of claims 1-5, characterized in that said selecting a history case with the largest refined similarity, and selecting the processing method of the history case with the largest refined similarity as the processing result of the new case, specifically: and if the maximum fine ranking similarity is larger than the threshold value, directly selecting the processing method of the historical case as a result, and otherwise, generating a processing strategy through a model.

7. The case automatic processing method based on big data as claimed in claim 6, characterized in that the processing strategy is generated by a model, specifically: constructing a seq2seq model, wherein a BERT model is selected as an encoding module encoder of the seq2seq model, and a BERT model is also selected as a decoding module decoder of the seq2seq model; the input is case description of a case, the output is a corresponding processing method, a Model3 is obtained by training a Model through historical cases and the corresponding processing method, and the case description of a new case is input into the Model3 to obtain a generated processing method.

8. The method according to claim 7, further comprising the step of judging the reasonableness of the processing method generated by the Model 3: and calculating the matching degree between the case description of the new case and the processing method generated by the Model3 through the matching degree Model2, wherein if the matching degree is greater than a set threshold value, the generated processing method is more suitable and can be directly used, and otherwise, manual adjustment is performed.

9. An automatic case processing device based on big data, which is characterized in that the device can realize an automatic case processing method based on big data as claimed in any one of claims 1-8.

10. A terminal device, characterized in that the terminal device comprises a big data based case automatic processing device according to claim 9.