CN110659997A

CN110659997A - Data cluster identification method and device, computer system and readable storage medium

Info

Publication number: CN110659997A
Application number: CN201910754337.4A
Authority: CN
Inventors: 张密; 唐文
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2019-08-15
Filing date: 2019-08-15
Publication date: 2020-01-07
Anticipated expiration: 2039-08-15
Also published as: CN110659997B

Abstract

The invention discloses a data clustering identification method, a data clustering identification device, a computer system and a readable storage medium, which are based on artificial intelligence and comprise the following steps: setting one case in the case library as a reference case, and setting other cases except the reference case in the case library as comparison cases; sequentially judging whether the reference case and each reference case have an incidence relation or not and making a one-dimensional vector; sequentially obtaining one-dimensional vectors of all cases in the case library, and combining the one-dimensional vectors of all cases to obtain an adjacency matrix; computing a adjacency matrix to obtain a dense vector; calculating dense vectors to cluster all cases in the case library and output clustering results; and calculating the risk value of the case according to the policy information, the automobile characteristic and the report characteristic of the case in the clustering result. The invention obtains the high-risk cases by analyzing the cases in the clusters below the clustering threshold, so that a practitioner can perform key analysis on the high-risk cases to identify cases which are suspected of fraud and are not compensated or reported in the high-risk cases.

Description

Data cluster identification method and device, computer system and readable storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a data cluster identification method, an apparatus, a computer system, and a readable storage medium.

Background

The current insurance anti-fraud scheme mainly discriminates cases through black and white list rules, but the existing black and white list rule engine only depends on human experience, so that not only is erroneous judgment easily caused on case discrimination, but also great fatigue strength is brought to workers, and the cost of enterprise personnel is increased;

if a case is screened after a mature neural network is manufactured by adopting a supervised learning modeling mode according to black and white list rules, a large amount of labeled data, namely the case with suspected fraud property, is needed for neural network learning; this approach is often difficult to implement in practice due to the difficulty in obtaining large amounts of labeled data, and thus difficult to make into an effective and reliable mature neural network.

Disclosure of Invention

The invention aims to provide a data cluster identification method, a data cluster identification device, a computer system and a readable storage medium, which are used for solving the problems in the prior art.

In order to achieve the above object, the present invention provides a data cluster identification method, comprising the steps of:

a data cluster recognition device S1, wherein one case in the case library is set as a reference case, and other cases except the reference case in the case library are set as comparison cases; extracting a reference case and a comparison case from the case library, and sequentially judging whether the reference case and each comparison case have an association relationship; if yes, assigning 1 to the relation value between the reference case and the comparison case, and if not, assigning 0 to the relation value between the reference case and the comparison case; making a one-dimensional vector of the reference case according to the relation value between the reference case and each reference case; the data information of the case comprises a field picture, case report text information, field structure information, policy information, automobile characteristics and case report characteristics;

s2: sequentially setting the cases in the case library as reference cases according to the method of S1, sequentially obtaining one-dimensional vectors of the reference cases, and combining the one-dimensional vectors of all the cases in the case library to obtain an adjacency matrix;

s3: calculating the adjacency matrix by using an SDNE algorithm to obtain a dense vector for expressing the incidence relation among all cases in the case library;

s4: calculating the dense vector by using an AP clustering algorithm to cluster all cases in the case library and output a clustering result;

s5: calculating a risk value of a case according to policy information, automobile characteristics and report characteristics of the case in the clustering result, and judging whether the case in the cluster of the clustering result is a high-risk case or a low-risk case according to the risk value; generating a high-risk signal according to the high-risk case and outputting the high-risk signal to a client, or generating a low-risk signal according to the low-risk case and outputting the low-risk signal to the client;

wherein the high risk signal and the low risk signal are to be output to the client in the form of communication signals, respectively.

In the foregoing solution, the step S1 includes the following steps:

s11: extracting one case in a case library as a reference case, and setting cases except the basic case in the case library as comparison cases; judging the incidence relation between the reference case and the comparison case;

s12: sequentially judging the incidence relation between the reference case and all the comparison cases in the case library according to the method of S11;

s13: if the reference case and the comparison case have an incidence relation, assigning a relation value between the reference case and the comparison case to be 1; if the reference case and the comparison case do not have the incidence relation, assigning the relation value between the reference case and the comparison case to be 0;

s14: combining the relation values between the reference case and each control case into a one-dimensional vector; and the element value of each column in the one-dimensional vector is a relation value between the reference case and each control case.

In the above solution, in the step S2, all cases in the case library are sequentially used as reference cases, and the association relationship between each reference case and the reference case is sequentially obtained according to the method of S1, and the one-dimensional vector of each reference case is sequentially obtained; and combining the one-dimensional vectors of the reference cases to obtain an adjacency matrix.

In the foregoing solution, the step S3 includes the following steps:

s31: the SDNE algorithm has an embedded layer dimension, and the adjacency matrix is recorded into an input layer of the SDNE algorithm;

s32: controlling the SDNE algorithm to learn the first-order neighbor relation in the adjacent matrix in a supervised mode, and learning the second-order neighbor relation in the adjacent matrix in an unsupervised deep learning mode; and optimizing the SDNE algorithm by combining the loss functions of the two learning processes, and finally extracting an embedded layer in the SDNE algorithm as a dense vector of all cases in the case library.

In the foregoing solution, the step S4 includes the following steps:

s41: inputting the dense vector into an AP clustering algorithm to obtain Euclidean distances between any cases in the case library and obtain a similarity matrix;

s42: clustering the cases in the dense vector according to the similarity matrix, sequentially obtaining the degree of each case in the dense vector as the clustering center of other cases through an attraction information matrix of an AP clustering algorithm, and sequentially obtaining the degree of each case in the dense vector for selecting other cases as the clustering center of the case through an attribution information matrix of the AP clustering algorithm; and ending and outputting the clustering result until the iteration times are reached or the samples in each clustering region in the dense vector are kept unchanged.

In the foregoing solution, the step S42 includes the following steps:

s42-1: setting the iteration number of the AP clustering algorithm as T, and initializing r (i, j) and a (i, j) to be 0;

wherein r (i, j) is attraction information in an attraction information matrix and is used for describing the degree of the condition j in the dense vector which is suitable for serving as the clustering center of the condition i;

a (i, j) is attribution information in an attribution information matrix and is used for selecting a case j as the suitability degree of the case j according to a clustering center in the dense vector;

s42-2, iterating the attraction information matrix r (i, j) in the AP clustering algorithm according to the following formula;

r(i,j)＝s(i,j)-max(j'≠j){a(i,j')+s(i,j')}；

wherein s (i, j) is the Euclidean distance between the dense vector case i and the case j, namely the similarity between the case i and the case j;

the suitability of this formula for describing j as the center of i is: subtracting the similarity between i and j from the similarity between i and j' and selecting the case j as the maximum value of the sum of the suitability degrees of the case i according to the clustering center;

s42-3: iterating the attribution information a (i, j) in the AP clustering algorithm according to the following formula and conditions;

when i! J, a (i, j) is min (0, r (j, j) + Σ (i 'not in { i, j }) max {0, r (i', j) });

when i ═ j, a (i, i) ═ Σ (i '≠ i) max {0, r (i', i) };

wherein the formula is used to describe a cluster in which i is a cluster center if i ═ j, and j is i otherwise;

s42-4: and (4) iterating the AP clustering algorithm according to the steps S42-2 and S42-3 until the iteration times reach T or the samples in each clustering area are kept unchanged, ending the AP clustering algorithm and outputting clustering results.

In the foregoing solution, the step S5 includes the following steps:

s51: obtaining clusters of which the number of cases is below a clustering threshold in a clustering result; extracting policy information, automobile characteristics and report characteristics of cases in the cluster;

s52: acquiring the number A of invalid policy maintenance in the cluster according to the policy maintenance information;

s53: acquiring the quantity B of second-hand vehicles and the quantity C of porcelain-bumping vehicles of the cases in the cluster according to the automobile characteristics;

s54: acquiring the number D of blacklist report persons in the cluster according to the report person characteristics;

s55: calculating the number A of the invalid insurance policies, the number B of the second-hand vehicles, the number C of the porcelain collision vehicles and the number D of blacklist reporters according to a weighted sum formula, and obtaining a risk value Y;

s56: if the risk value Y does not exceed a risk threshold, determining that the case in the cluster is a low risk case;

if the risk value Y exceeds a risk threshold value, determining that the case in the cluster is a high-risk case;

s57: generating a high-risk signal according to the high-risk case, and outputting the high-risk signal to a client in a communication signal form; or

And generating a low-risk signal according to the low-risk case, and outputting the low-risk signal to the client in the form of a communication signal.

In order to achieve the above object, the present invention further provides a data cluster identification apparatus, including:

a data cluster identification apparatus, comprising:

the one-dimensional vector formulation module is used for setting one case in a case library as a reference case and setting other cases except the reference case in the case library as comparison cases; extracting a reference case and a comparison case from the case library, and sequentially judging whether the reference case and each comparison case have an association relationship; if yes, assigning 1 to the relation value between the reference case and the comparison case, and if not, assigning 0 to the relation value between the reference case and the comparison case; making a one-dimensional vector of the reference case according to the relation value between the reference case and each reference case; the data information of the case comprises a field picture, case report text information, field structure information, policy information, automobile characteristics and case report characteristics;

the adjacency matrix formulation module is used for calling the one-dimensional vector formulation module to sequentially set the cases in the case library as reference cases, sequentially obtain one-dimensional vectors of all the reference cases, and merge the one-dimensional vectors of all the cases in the case library to obtain an adjacency matrix;

the vector operation module is used for calculating the adjacency matrix by using an SDNE algorithm to obtain dense vectors of all cases in the case library;

the clustering operation module is used for calculating the dense vector by utilizing an AP clustering algorithm so as to cluster all cases in the case library and output a clustering result;

the risk evaluation module is used for calculating the risk value of the case according to the policy information, the automobile characteristic and the report characteristic of the case in the clustering result, and judging the case in the cluster of the clustering result as a high-risk case or a low-risk case according to the risk value; generating a high-risk signal according to the high-risk case and outputting the high-risk signal to a client, or generating a low-risk signal according to the low-risk case and outputting the low-risk signal to the client; wherein the high risk signal and the low risk signal are to be output to the client in the form of communication signals, respectively.

The invention also provides a computer system which comprises a plurality of computer devices, wherein each computer device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and the processors of the plurality of computer devices jointly realize the steps of the data cluster identification method when executing the computer program.

In order to achieve the above object, the present invention further provides a computer-readable storage medium, which includes a plurality of storage media, each storage medium having a computer program stored thereon, wherein the computer programs stored in the storage media, when executed by a processor, collectively implement the steps of the data cluster identification method.

The invention provides a data clustering identification method, a data clustering identification device, a computer system and a readable storage medium.A one-dimensional vector formulation module and an adjacent matrix formulation module are firstly utilized to obtain an adjacent matrix for expressing the incidence relation between cases in a case library; using a vector operation module to reduce the dimension of the adjacent matrix to obtain a dense vector for expressing the incidence relation among all cases in the case library; clustering dense vectors by using a clustering operation module, and calculating the dense vectors by using an AP clustering algorithm to obtain a clustering result for clustering all cases in the case base; acquiring a high-risk case and a low-risk case by using a risk evaluation module;

in the insurance industry, suspected fraud reporting is generally a small probability event, and the probability of gang case planning is very high, so that a clustering threshold is adjusted through business experience and regional conditions, a high risk case is obtained by analyzing cases in clusters below the clustering threshold, a practitioner can perform key analysis on the high risk case to identify cases which are suspected fraud and not recouperated or reported in the high risk case, and recouperate or alarm is performed on the case to recover the loss of an insurance company due to fraud;

meanwhile, the evaluation accuracy of case fraud risk is greatly improved, and the speed and efficiency of case evaluation are also greatly improved; compared with the situation that the existing cases are directly handed to manual processing, the method greatly reduces the investment of labor cost;

moreover, because the technical scheme is to cluster the cases in the case library and to analyze the cases in the clusters below the clustering threshold, a large amount of labeled data, namely a large amount of cases with suspected fraud properties, do not need to be provided for neural network learning, so that the cases can be easily implemented in practice, and the risk assessment of the cases can be quickly and effectively realized.

Drawings

FIG. 1 is a flowchart of a first embodiment of a data cluster identification method according to the present invention;

FIG. 2 is a flowchart illustrating a process between a data cluster recognition device and a service system according to a first embodiment of the data cluster recognition method of the present invention;

FIG. 3 is a schematic diagram of program modules of a second embodiment of the data cluster identification apparatus according to the present invention;

fig. 4 is a schematic diagram of a hardware structure of a computer device in the third embodiment of the computer system according to the present invention.

Reference numerals:

1. data clustering recognition device 2, case library 3, computer equipment 4 and client

11. One-dimensional vector formulation module 12, adjacency matrix formulation module 13 and vector operation module

14. Clustering operation module 15, risk assessment module 31, memory 32 and processor

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a data clustering identification method based on a one-dimensional vector formulation module, an adjacent matrix formulation module, a vector operation module, a clustering operation module and a risk assessment module, which is suitable for the field of artificial intelligence. Firstly, a one-dimensional vector formulation module and an adjacency matrix formulation module are utilized to obtain an adjacency matrix for expressing the incidence relation between cases in a case library; using a vector operation module to reduce the dimension of the adjacent matrix to obtain a dense vector for expressing the incidence relation among all cases in the case library; clustering dense vectors by using a clustering operation module, and calculating the dense vectors by using an AP clustering algorithm to obtain a clustering result for clustering all cases in the case base; acquiring a high-risk case and a low-risk case by using a risk evaluation module;

the high-risk cases are obtained by analyzing the cases in the clusters below the clustering threshold value, so that a practitioner can perform key analysis on the high-risk cases to identify cases which are suspected of fraud and are not compensated or reported, and the compensation or alarm is performed on the cases to recover the loss of an insurance company caused by fraud.

Example one

Referring to fig. 1 and fig. 2, the data cluster identification method of the present embodiment, which uses the data cluster identification device 1, includes the following steps:

s1, setting one case in the case base 2 as a reference case and setting other cases except the reference case in the case base 2 as comparison cases; extracting a reference case and a comparison case from the case library 2, and sequentially judging whether the reference case and each comparison case have an association relationship; if yes, assigning 1 to the relation value between the reference case and the comparison case, and if not, assigning 0 to the relation value between the reference case and the comparison case; making a one-dimensional vector of the reference case according to the relation value between the reference case and each reference case; the data information of the case comprises a field picture, case report text information, field structure information, policy information, automobile characteristics and case report characteristics;

s2: sequentially setting the cases in the case library 2 as reference cases according to the method of S1, sequentially obtaining one-dimensional vectors of the reference cases, and combining the one-dimensional vectors of all the cases in the case library 2 to obtain an adjacency matrix;

s3: calculating the adjacency matrix by using an SDNE algorithm to obtain a dense vector for expressing the incidence relation among all cases in the case library 2;

s4: calculating the dense vector by using an AP clustering algorithm to cluster all cases in the case library 2 and outputting a clustering result;

s5: calculating a risk value of a case according to policy information, automobile characteristics and report characteristics of the case in the clustering result, and judging whether the case in the cluster of the clustering result is a high-risk case or a low-risk case according to the risk value; generating a high-risk signal according to the high-risk case and outputting the high-risk signal to the client 4, or generating a low-risk signal according to the low-risk case and outputting the low-risk signal to the client 4;

wherein the high-risk signal and the low-risk signal are to be output to the client 4 in the form of communication signals, respectively.

Specifically, the step S1 includes the following steps:

further, the basic information in the step S11 includes a scene picture, an application text and scene structure information;

the historical report data comprises a scene picture, report text information and scene structure information.

Further, the step S11 includes the following steps:

s11-01: extracting the sift characteristics of all the field pictures in the case library 2;

preferably, extracting the sift characteristic of the field picture as a 128-bit sift characteristic;

s11-02: summarizing the sift characteristics of all field pictures to obtain a local description subset; clustering the local descriptor set by using a clustering algorithm to combine similar sift characteristics and construct a visual dictionary;

in the step, the clustering algorithm is a K-Means algorithm;

further, the K value in the K-Means algorithm is 80, so that the clustering algorithm can divide sift features in the local descriptor set into 80 clusters;

s11-03: selecting one case from the case library 2 as a reference case, and taking cases except the reference case in the case library 2 as comparison cases; taking the field picture in the reference case as a fixed picture, and taking the field picture in the comparison case as a comparison picture; selecting a comparison picture of a comparison case, and respectively selecting a plurality of sift features with the highest return value in the visual dictionary from the fixed picture and the comparison picture;

preferably, 80 sift features which have the highest value and are not repeated are obtained from the fixed picture and the comparison picture respectively according to the visual dictionary;

s11-04: respectively manufacturing the sift characteristics of the fixed picture and the comparison picture into a fixed vector and a comparison vector;

s11-05: calculating cosine similarity of the fixed picture vector and the comparison picture vector of the picture vector, and obtaining a picture similarity value;

s11-06: if the similarity value exceeds a picture similarity threshold value, judging that the fixed picture is associated with the comparison picture, and further judging that the reference case is associated with the comparison case;

and if the similarity value does not exceed the image similarity threshold, judging that the fixed image is not associated with the reference image, and further judging that the reference case does not have an association relation with the reference case.

The step S12 includes the steps of:

s12-01: judging whether all the reference cases in the case library 2 have the association relation with the reference case according to the step S11-03-the step S11-06;

preferably, the step S11 further includes the steps of:

s11-11: extracting all report texts in the case library 2, training word2vec by using the report texts, and obtaining word vectors;

s11-12: selecting the reference case in the S11-03 from a case library 2, and extracting an application text in the reference case as a reference text; selecting a reference case in S11-03, and extracting an application text from the reference case as a reference text; respectively calculating word frequency vectors of the reference text and the comparison text according to the word vectors to obtain a reference word frequency vector and a comparison word frequency vector;

s11-13: calculating cosine similarity between the reference word frequency vector and the comparison word frequency vector, and obtaining a text similarity value;

s11-14: if the text similarity value is larger than a text similarity threshold value, judging that text association exists between the reference text and the comparison text, and further judging that an association relation exists between the reference case and the comparison case;

if the text similarity value is smaller than a text similarity threshold value, judging that the reference text and the comparison text do not have text association, and further judging that the reference case and the comparison case do not have an association relation;

the step S12 further includes the steps of:

s12-11: according to the step S11-12-step S11-14, judging whether all the reference cases in the case library 2 have the association relation with the reference case;

preferably, the step S11 further includes the steps of:

s11-21: acquiring the reference case in the S11-03 from the case library 2, and extracting the field structure information in the reference case as reference field information; selecting one of the comparison cases in S11-03, and extracting the site structure information in the comparison case as comparison site information;

s11-22: comparing the factor fields of the reference field information and the comparison field information;

in the step, the factor field of the field structure information comprises the information of a counter, a number plate and a frame number of a involved vehicle; an example of field structure information is now provided as follows:

report people information Zhang III, 13900000000

Shanghai X, 00000 brand of pertinent vehicle

Case-involved vehicle frame number WAUR1111111111111

Comparing the contents corresponding to the notice person information, the involved license plate number and the involved vehicle frame number in the reference site information and the reference site information one by one;

optionally, the reporter information may be a reporter phone, a reporter name, or a combination of a reporter phone and a reporter name.

S11-23: if the reference site information and the comparison site information have corresponding factor fields with consistent contents, judging that the reference case and the comparison case have an association relationship;

and if the reference field information and the comparison field information do not have corresponding factor fields with consistent contents, judging that the reference case and the comparison case do not have an association relation.

The step S12 further includes the steps of:

s12-21: according to the step S11-21-step S11-23, it is determined whether all the reference cases in the case library 2 have an association relationship with the reference case.

Specifically, in the step S2, all cases in the case library 2 are sequentially used as reference cases, and according to the step S11, the association relationship between each reference case and the reference case is sequentially obtained, and the one-dimensional vector of each reference case is sequentially obtained; and integrating the one-dimensional vectors of the reference cases to obtain an adjacency matrix.

In this step, the cases in the case library 2 have case labels, the cases in the case library are sequentially set as reference cases according to the case labels, and the one-dimensional vectors of the cases are sequentially obtained through the step S1.

Optionally, a label stack is provided, and the cases in the case library 2 have case numbers, and all case numbers are stored in the label stack; randomly extracting a case label from the mark stack, acquiring a case corresponding to the case label from the case library 2 as a reference case, setting other cases except the reference case as comparison cases, obtaining a one-dimensional vector of the reference case by the method of S1, and removing the case label of the reference case from the mark stack; repeating the above operations until the mark stack is empty, and obtaining the one-dimensional vectors of all cases.

In the field of mathematics, a graph in graph theory is defined as a graph formed by a plurality of given points and connecting lines between the points, and is generally used for describing an internal relation or a specific relation between some things. In the embodiment, cases are taken as nodes in the graph theory, and the connecting lines between the nodes are the incidence relation between two cases, and the incidence relation between the cases in the case library 2 is expressed by using the adjacent matrix concept in the graph theory;

in graph theory and computer science, an adjacency matrix is used as a mode for expressing a structure in a graph, only two values of 1 and 0 exist, whether nodes in the graph have an incidence relation or not is represented, main diagonals are all 0, the simple undirected graph is a symmetric matrix, if the two cases have the incidence relation, 1 is assigned, and if the two cases do not have the incidence relation, 0 is assigned;

therefore, the row of the one-dimensional vector is used for expressing a reference case, the column of the one-dimensional vector is used for expressing a comparison case, and the element value in the one-dimensional vector is used for expressing the incidence relation between the reference case and each comparison case; therefore, by sequentially combining the one-dimensional vectors of the cases in a row direction, an adjacency matrix is obtained;

the rows in the adjacency matrix are used for expressing each reference case, the columns in the adjacency matrix are used for expressing each control case, the element values in the adjacency matrix are used for expressing the association relationship between the reference case on the row where the element value is located and the control case on the column where the element value is located.

Specifically, the step S3 includes the following steps:

s31: the SDNE (structured Deep Network Embedding) algorithm has an Embedding layer dimension, and the adjacency matrix is recorded into an input layer of the SDNE algorithm;

wherein, the dimension of the embedding layer of the SDNE algorithm can be set according to the requirement, such as 100 dimensions;

s32: controlling the SDNE algorithm to learn the first-order neighbor relation in the adjacent matrix in a supervised mode, and learning the second-order neighbor relation in the adjacent matrix by using an unsupervised deep learning technology AutoEncoder (automatic encoder); optimizing the SDNE algorithm by combining the loss functions of the two learning processes, and finally extracting an Embedding layer (Embedding) in the SDNE algorithm as dense vectors of all cases in the case library 2;

in the step, the first order neighbor relation and the second order neighbor relation are combined and integrated in the learning process; through the first-order neighbor relation and the second-order neighbor relation, local characteristics and global characteristics of the network can be well captured.

The first-order neighbor relation refers to the proximity of local point pairs between two vertexes, and the second-order neighbor relation refers to the similarity between a pair of vertexes and represents the similarity between the neighborhood network structures in the network; learning in a supervision mode, and using the first-order neighbor relation as supervision information to store the global structure of the network; the SDNE algorithm framework which is learned in a supervision mode consists of a plurality of nonlinear mapping functions, and an adjacent matrix is mapped to a high-dimensional nonlinear hidden space to capture a network structure; therefore, the SDNE algorithm learns the first-order neighbor relation in the adjacency matrix in a supervised manner and obtains a first-order loss function L_1st(ii) a The second-order neighbor relation refers to the similarity of the neighbors of the nodes, so the second-order similarity of the model needs the properties of the neighbors of each node; the SDNE model which is learned by the unsupervised deep learning technology comprises an automatic encoder and a decoder; the automatic encoder is composed of a plurality of nonlinear functions, and adjacent matrixes are mapped to a representation space; correspondingly, the decoder is also composed of a plurality of nonlinear functions, and the nonlinear functions map the representation space to the reconstruction space of the adjacent matrix; therefore, the SDNE algorithm learns the second-order neighbor relation in the adjacency matrix using an unsupervised deep learning technique AutoEncoder (AutoEncoder) and obtains a second-order loss function L_2nd；

The loss function of the two learning processes is combined by providing a combined optimization loss function, so that the combined optimization loss function is minimized, and the purpose of optimizing the SDNE algorithm is achieved; the joint optimization loss function is:

L_mix＝L_2nd+αL_1st+vL_reg

wherein L is_regThe method comprises the following steps that (1) a regularization term is adopted, alpha is a parameter for controlling 1-order loss, and v is a parameter for controlling the regularization term;

and extracting an optimized Embedding layer (Embedding) in the SDNE algorithm to be used as a dense vector of all cases in the case library 2.

For example, if the number of cases in the case library 2 is 10 ten thousand, an adjacency matrix of 10 ten thousand × 10 ten thousand dimensions is generated; and recording the adjacent matrix into an SDNE algorithm for learning, combining with learning in a supervised mode and learning in an unsupervised mode to optimize the SDNE algorithm, and finally extracting an embedded layer in the SDNE algorithm as a dense vector, wherein the dense vector is a matrix of 10 ten thousand multiplied by 100.

Specifically, the step S4 includes the following steps:

s41: recording the dense vector into an AP clustering algorithm (Affinity Propagation) to obtain Euclidean distance between any cases in the case library 2 and obtain a similarity matrix;

the AP clustering algorithm is one of common clustering algorithms, is different from a Kmeans algorithm and the like, the AP algorithm does not need to determine the clustering number in advance, and a clustering center found by the AP algorithm is a point really existing in data;

in this step, the dense vector is input into the AP clustering algorithm, and the AP clustering algorithm calculates the euclidean distance between each case through the dense vector as the similarity s (i, j) between any two cases, that is, the similarity between the ith case and the jth case is expressed through the distance s between the ith case and the jth case;

and calculating the similarity among all cases in the case library 2 according to the dense vector according to the method, and summarizing to form a similarity matrix.

S42: clustering the cases in the dense vector according to the similarity matrix, sequentially obtaining the degree of each case in the dense vector as the clustering center of other cases through an attraction information matrix of an AP clustering algorithm, and sequentially obtaining the degree of each case in the dense vector for selecting other cases as the clustering center of the case through an attribution information matrix of the AP clustering algorithm; ending and outputting a clustering result until the iteration times are reached or the samples in each clustering area in the dense vector are kept unchanged;

in this step, when s (i, j) > s (i, k), it indicates that the similarity between the sample i and the sample j is greater than the similarity between the sample i and the sample k;

the AP clustering algorithm is provided with an attraction information (attraction) matrix R, wherein attraction information R (i, j) in the matrix R is used for describing the degree to which a case j is suitable as a clustering center of the case i, and the attraction information represents messages from i to j;

the AP clustering algorithm has an attribution information (availability) matrix A, wherein attribution information a (i, j) in the matrix A is used for describing the suitability degree of a case i for selecting a case j as a clustering center, and represents messages from j to i.

Further, the step S42 includes the following steps:

s42-2, iterating the attraction information matrix r (i, j) in the AP clustering algorithm according to the following formula,

r(i,j)＝s(i,j)-max(j'≠j){a(i,j')+s(i,j')}；

i.e. the suitability of j as the center of i is: subtracting the similarity between i and j from the similarity between i and j' and selecting the case j as the maximum value of the sum of the suitability degrees of the case i according to the clustering center;

in the step, the appropriateness s (i, j) of j becoming the clustering center of i is recorded through the similarity matrix, so that only k needs to be proved to be more suitable than other cases, and for other cases j ', s (i, j ') represents the appropriateness of the case j ' as the clustering center of the case i;

then a (i, j ') is defined to represent the attribution degree of i to the case j';

the two values are added, and the appropriateness of the case j ' as the clustering center of the case i can be calculated by a (i, j ') + s (i, j ');

here, finding the largest a (i, j ') + S (i, k '), i.e., max { a (i, j ') + S (i, j ') }, among all other cases j ', and then using S (i, j) -max { a (i, j ') + S (i, k ') }, we can get the attraction of k to i: r (i, j) ═ S (i, j) -max { a (i, j ') + S (i, j') };

when i ═ j, a (i, i) ═ Σ (i '≠ i) max {0, r (i', i) };

determining a clustering center, if i is j, i is the clustering center, otherwise j is the clustering of i;

j＝argmax{a(i,j)+r(i,j)}；

in this step, the attraction degree r (i', j) of the case j to other cases is calculated, and then an accumulation sum is made to represent the attraction degree of the case j to other cases: Σ max {0, r (i', j) }, then r (j, j) is added;

according to the formula of the attraction degree, we can see that other r (j, j) reflect how many cases j are not suitable for being divided into other clustering centers; a (j, j) reflects mainly the ability of k to act as a cluster center.

S42-4: iterating the AP clustering algorithm according to the steps S42-2 and S42-3 until the iteration times reach T or the samples in each clustering area remain unchanged, ending the AP clustering algorithm and outputting clustering results;

in this step, the AP clustering algorithm finally outputs a clustering result having a plurality of clusters, and cases corresponding to nodes in each cluster are regarded as the same class.

Specifically, the step S5 includes the following steps:

preferably, the clustering threshold can be adjusted as needed.

Optionally, the clustering threshold may be 100.

In this step, the policy information includes an effective policy and a failure policy;

the automobile is characterized by comprising a handcart, a second-hand car and a porcelain-bumping car;

the report character comprises a normal report and a blacklist report;

further, the blacklist is used for storing suspected fraud-related counter information, and if a certain case is determined to be a fraud case through investigation, the counter information of the case is recorded into the blacklist;

the blacklist declaration persons are declaration person information in the blacklist, and the normal declaration persons are declaration person information which is not input into the blacklist.

in this step, the weighted sum formula is:

y ═ mA + nB + pC + qD, where m, n, p, q are natural numbers, respectively;

and m, n, p and q can be adjusted according to needs;

Generating a low-risk signal according to a low-risk case, and outputting the low-risk signal to a client in the form of a communication signal;

in the step, a high-risk case is converted into an information source and input into sending equipment, the sending equipment converts the information source into an analog signal or a digital signal as a communication signal, and the communication signal is output to a client through a channel;

converting the low-risk case into an information source and inputting the information source into sending equipment, converting the information source into an analog signal or a digital signal as a communication signal by the sending equipment, and outputting the communication signal to a client through a channel;

the information source is an electric signal, so that the data information of the high-risk case or the data information of the low-risk case is converted into the electric signal and is input to the sending equipment;

the sending equipment can be an analog communication system or a digital communication system; the analog communication system is used for converting the electric signal into an analog signal through the modulator and outputting the analog signal to the client through a channel; the digital communication system is used for carrying out compression coding, encryption coding, channel coding and digital modulation operation on the electric signal, converting the electric signal into a digital signal and outputting the digital signal to a client through a channel;

the channel is a physical medium for transmitting a communication signal received from a sending device to a client, and is divided into two categories, namely a wired channel and a wireless channel; for example, a mobile communication channel (such as 2G, 3G, or 4G, Wimax) or a wired communication channel (such as Digital Subscriber Line (DSL), or Power Line Communication (PLC)) may be used.

Example two

Referring to fig. 3, the data cluster identification apparatus 1 of the present embodiment includes:

a one-dimensional vector formulation module 11, configured to set one case in the case library 2 as a reference case, and set other cases except the reference case in the case library 2 as comparison cases; extracting a reference case and a comparison case from the case library 2, and sequentially judging whether the reference case and each comparison case have an association relationship; if yes, assigning 1 to the relation value between the reference case and the comparison case, and if not, assigning 0 to the relation value between the reference case and the comparison case; making a one-dimensional vector of the reference case according to the relation value between the reference case and each reference case; the data information of the case comprises a field picture, case report text information, field structure information, policy information, automobile characteristics and case report characteristics;

the adjacency matrix formulation module 12 is configured to invoke the one-dimensional vector formulation module 11 to sequentially set cases in the case library 2 as reference cases, sequentially obtain one-dimensional vectors of the reference cases, and combine the one-dimensional vectors of all the cases in the case library 2 to obtain an adjacency matrix;

a vector operation module 13, configured to calculate the adjacency matrix by using an SDNE algorithm to obtain dense vectors of all cases in the case library 2;

the clustering operation module 14 is configured to calculate the dense vector by using an AP clustering algorithm, so as to cluster all cases in the case library 2, and output a clustering result;

the risk evaluation module 15 is used for calculating a risk value of a case according to the policy information, the automobile characteristic and the report characteristic of the case in the clustering result, and judging whether the case in the cluster of the clustering result is a high-risk case or a low-risk case according to the risk value; generating a high-risk signal according to the high-risk case and outputting the high-risk signal to the client 4, or generating a low-risk signal according to the low-risk case and outputting the low-risk signal to the client 4; wherein the high-risk signal and the low-risk signal are to be output to the client 4 in the form of communication signals, respectively.

The technical scheme is based on artificial intelligence, a classification model is established by a one-dimensional vector formulation module adjacent matrix formulation module, a vector operation module and a clustering operation module by using an intelligent decision making technology, and dense vectors are clustered by a clustering algorithm to obtain a clustering result for clustering all cases in a case library; acquiring a high-risk case and a low-risk case by using a risk evaluation module; and the identification of high-risk cases is realized.

Example three:

in order to achieve the above object, the present invention further provides a computer system, where the computer system includes a plurality of computer devices 3, and components of the data cluster recognition apparatus 1 according to the second embodiment may be distributed in different computer devices, and the computer devices may be smartphones, tablet computers, notebook computers, desktop computers, rack servers, blade servers, tower servers, or rack servers (including independent servers or a server cluster formed by a plurality of servers) that execute programs, and the like. The computer device of the embodiment at least includes but is not limited to: a memory 31, a processor 32, which may be communicatively coupled to each other via a system bus, as shown in FIG. 4. It should be noted that fig. 4 only shows a computer device with components, but it should be understood that not all of the shown components are required to be implemented, and more or fewer components may be implemented instead.

In the present embodiment, the memory 31 (i.e., a readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 31 may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory 31 may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device. Of course, the memory 31 may also include both internal and external storage devices of the computer device. In this embodiment, the memory 31 is generally used to store an operating system and various application software installed on the computer device, such as the program code of the data cluster identification apparatus in the first embodiment. Further, the memory 31 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 32 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 32 is typically used to control the overall operation of the computer device. In this embodiment, the processor 32 is configured to run the program code stored in the memory 31 or process data, for example, run a data cluster recognition device, so as to implement the data cluster recognition method of the first embodiment.

Example four:

to achieve the above objects, the present invention also provides a computer-readable storage system including a plurality of storage media such as a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by a processor 32, implements corresponding functions. The computer readable storage medium of this embodiment is used for storing a data cluster identification device, and when being executed by the processor 32, the data cluster identification method of the first embodiment is implemented.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. The data cluster identification method is characterized by comprising the following steps:

s1, setting one case in the case library as a reference case and setting other cases except the reference case in the case library as comparison cases; extracting a reference case and a comparison case from the case library, and sequentially judging whether the reference case and each comparison case have an association relationship; if yes, assigning 1 to the relation value between the reference case and the comparison case, and if not, assigning 0 to the relation value between the reference case and the comparison case; making a one-dimensional vector of the reference case according to the relation value between the reference case and each reference case; the data information of the case comprises a field picture, case report text information, field structure information, policy information, automobile characteristics and case report characteristics;

2. The data cluster identification method according to claim 1, wherein said step S1 includes the steps of:

3. The data cluster identification method according to claim 1, wherein in step S2, all cases in the case library are sequentially used as reference cases, and the association between each reference case and the reference case is sequentially obtained according to the method of S1, and the one-dimensional vector of each reference case is sequentially obtained; and combining the one-dimensional vectors of the reference cases to obtain an adjacency matrix.

4. The data cluster identification method according to claim 1, wherein said step S3 includes the steps of:

5. The data cluster identification method according to claim 1, wherein said step S4 includes the steps of:

6. The data cluster identification method of claim 5, wherein the step S42 includes the steps of:

r(i,j)＝s(i,j)-max(j'≠j){a(i,j')+s(i,j')}；

when i! J, a (i, j) is min (0, r (j, j) + Σ (i 'notin { i, j }) max {0, r (i', j) });

when i ═ j, a (i, i) ═ Σ (i '≠ i) max {0, r (i', i) };

7. The data cluster identification method according to claim 1, wherein said step S5 includes the steps of:

8. A data cluster identification apparatus, comprising:

9. A computer system comprising a plurality of computer devices, each computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processors of the plurality of computer devices when executing the computer program collectively implement the steps of the data cluster identification method of any one of claims 1 to 7.

10. A computer-readable storage medium comprising a plurality of storage media, each storage medium having a computer program stored thereon, wherein the computer programs stored in the storage media, when executed by a processor, collectively implement the steps of the data cluster identification method of any one of claims 1 to 7.