CN110852893A

CN110852893A - Risk identification method, system, equipment and storage medium based on mass data

Info

Publication number: CN110852893A
Application number: CN201910969841.6A
Authority: CN
Inventors: 唐文
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2020-02-28

Abstract

The embodiment of the invention provides a risk identification method based on mass data, which comprises the following steps: acquiring target user data of a target user; searching target historical accident data of the target user according to the target user data; analyzing a target community to which the target user data and the target historical accident data belong through a pre-established relationship network; acquiring target weight coefficients of the target user data and the target historical accident data according to the target community; calculating based on the target weight coefficient to obtain a risk coefficient of the target user; and acquiring a corresponding risk grade according to the risk coefficient. The embodiment of the invention also provides a risk identification system, equipment and a storage medium based on mass data, and the embodiment of the invention has the beneficial effects that: the risk coefficient of the target user data of the target user and the target historical accident data can be effectively obtained according to the pre-established relationship network, and the operation efficiency is improved.

Description

Risk identification method, system, equipment and storage medium based on mass data

Technical Field

The embodiment of the invention relates to the field of data processing, in particular to a risk identification method, a risk identification system, risk identification equipment and a storage medium based on mass data.

Technical Field

In the insurance industry, criminals often violate insurance regulations and adopt methods such as manufacturing insurance accidents to cheat insurance funds from insurance companies. Insurance companies need to identify insurance fraud by analyzing policy information and user information.

However, currently, when risk analysis is performed on fraud events in the insurance industry, analysis is performed based on a social network. However, based on the risk analysis of the social network, the traditional risk scoring method adopts rule scoring, wherein the rule setting and the rule score setting are summarized based on business experience. After the user data is woven into a network, the network triggers a certain number of rules and calculates the total score of the triggering rules, and the total score is used for evaluating the fraud risk of the user. However, the existing method has the following defects: the rule features are simple, and the identification error is large when the applied data is excessive; when a certain amount is accumulated, the computer equipment runs slowly, and the calculation is time-consuming.

Therefore, it is necessary to provide an efficient big data processing method to improve the processing efficiency and mining accuracy of data mining.

Disclosure of Invention

In view of this, an object of the embodiments of the present invention is to provide a risk identification method, system, device and storage medium based on mass data, which can effectively obtain risk coefficients of target user data and target historical accident data of a target user according to a pre-established relationship network, and improve operation efficiency.

In order to achieve the above object, an embodiment of the present invention provides a risk identification method based on mass data, including:

acquiring target user data of a target user;

searching target historical accident data of the target user according to the target user data;

analyzing a target community to which the target user data and the target historical accident data belong through a pre-established relationship network;

acquiring target weight coefficients of the target user data and the target historical accident data according to the target community;

calculating based on the target weight coefficient to obtain a risk coefficient of the target user; and

and generating a risk warning page or a risk control instruction according to the risk coefficient, wherein the risk warning page is used for sending and presenting in a target client, and the risk control instruction is used for controlling a target account of the target user or the processing state of a target order.

Further, the step of establishing the relationship network includes:

acquiring a user data set of a plurality of sample users according to a plurality of characteristic items, wherein the user data set comprises a plurality of characteristic data corresponding to the plurality of characteristic items of each sample user;

performing cluster analysis on a plurality of feature data of a plurality of sample users corresponding to each feature item, and analyzing to obtain a plurality of cluster centers corresponding to each feature item; each characteristic item of each sample user is correspondingly associated with a clustering center; and

and analyzing whether each sample user is associated with the same plurality of clustering centers with other sample users, and if more than two target sample users are associated with the same plurality of clustering centers, mapping the more than two target sample users to a target community.

Further, the step of analyzing the target community to which the target user data and the target historical accident data belong through a pre-established relationship network includes:

performing clustering analysis on the target user data and the target historical accident data to obtain a clustering center of the target user data and the target historical accident data;

and determining a target community corresponding to a clustering center of the target user data and the target historical accident data according to a fuzzy clustering algorithm.

Further, the step of obtaining a target weight coefficient of each target user data and the target historical accident data according to the target community comprises:

acquiring a weight coefficient of a target community corresponding to the target personal data and the target historical accident data, wherein the weight coefficient of each community is preset;

calculating a similarity coefficient of a target community corresponding to the target personal data and the target historical accident data;

and calculating to obtain a target weight coefficient according to the weight coefficient and the similarity coefficient.

In order to achieve the above object, an embodiment of the present invention further provides a risk identification system based on mass data, including:

the first acquisition module is used for acquiring target user data of a target user;

the searching module is used for searching the target historical accident data of the target user according to the target user data;

the analysis module is used for analyzing a target community to which the target user data and the target historical accident data belong through a pre-established relationship network;

the second acquisition module is used for acquiring target weight coefficients of the target user data and the target historical accident data according to the target community;

the calculation module is used for calculating to obtain a risk coefficient of the target user based on the target weight coefficient;

and the generating module is used for generating a risk warning page or a risk control instruction according to the risk coefficient, wherein the risk warning page is used for sending and presenting to a target client, and the risk control instruction is used for controlling the processing state of a target account or a target order of the target user.

Further, the analysis module is further configured to:

analyzing whether each sample user is associated with a same plurality of cluster centers as other sample users to map at least some of the plurality of sample users to communities in a relationship network: and if the analysis result shows that more than two target sample users are associated with the same plurality of clustering centers, mapping the more than two target sample users to a target community.

Further, the analysis module is further configured to:

Further, the second obtaining module is further configured to:

acquiring the target personal data and the weight coefficient of a target community corresponding to the target historical accident data, wherein the weight coefficient of each community is preset;

In order to achieve the above object, an embodiment of the present invention further provides a computer device, where the computer device includes a memory and a processor, where the memory stores thereon a risk identification system based on mass data that is executable on the processor, and when the risk identification system based on mass data is executed by the processor, the method implements the steps of the risk identification method based on mass data.

To achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, where the computer program is executable by at least one processor, so as to cause the at least one processor to execute the steps of the risk identification method based on mass data as described above.

According to the risk identification method, system, device and storage medium based on mass data, provided by the embodiment of the invention, the target user data and the historical car insurance data of the target user are subjected to cluster analysis through the pre-established relationship network to obtain the target community, and the target weight coefficient and the corresponding risk coefficient of the target user are further analyzed. Compared with the prior art, the method and the device can effectively obtain the target user data of the target user and the risk coefficient of the target historical accident data according to the pre-established relationship network, and improve the operation efficiency.

Drawings

Fig. 1 is a flowchart of a risk identification method based on mass data according to a first embodiment of the present invention.

Fig. 2 is a flowchart of the steps for establishing the relationship network shown in fig. 1 according to the embodiment of the present invention.

Fig. 3 is a flowchart of step S104 in fig. 1 according to an embodiment of the present invention.

FIG. 4 is a flowchart illustrating step S106 in FIG. 1 according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of program modules of a risk identification system based on mass data according to a second embodiment of the present invention.

Fig. 6 is a schematic diagram of a hardware structure of a third embodiment of the computer apparatus according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

Referring to fig. 1, a flowchart illustrating steps of a risk identification method based on mass data according to a first embodiment of the present invention is shown. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed. The following description is made by way of example with the computer device 2 as the execution subject. The details are as follows.

Step S100, target user data of a target user is obtained.

Specifically, the target user data includes target personal data of the target user, for example: target user's personal data such as target ID card number, target license plate number, target telephone number, target name and target address. The target personal data of the target user can be obtained through the target user input based on the user image interface or according to the target license plate number of the vehicle when the target user has an accident.

And S102, searching target historical accident data of the target user according to the target user data.

Specifically, based on one or more of the target user data, target historical accident data of the target user is obtained from an internal database or a third-party database. The target historical accident data comprises, but is not limited to, target historical vehicle insurance accident data and target historical vehicle insurance application data. The target historical vehicle insurance accident data comprises: target historical accident occurrence location data, target historical claim amount data, target repair location data, target repair amount data, and the like. The target historical vehicle insurance application data comprises: target historical vehicle insurance application type data, target historical vehicle insurance amount data and the like.

And step S104, analyzing a target community to which the target user data and the target historical accident data belong through a pre-established relationship network.

Specifically, the pre-established relationship network comprises a plurality of communities corresponding to a plurality of target sample users, and the communities are analyzed through a fuzzy clustering algorithm to determine target communities to which target user data and target historical accident data belong.

Illustratively, referring to fig. 2, the establishing step of the relationship network includes:

step S104A, acquiring a user data set of a plurality of sample users according to the plurality of feature items, where the user data set includes a plurality of feature data corresponding to the plurality of feature items of each sample user.

Specifically, the plurality of feature items comprise sample car insurance data and sample user data of sample users; for example: and sample vehicle insurance data such as address data, native place data, accident occurrence time data, accident vehicle related data, maintenance place data and the like. Taking the accident occurrence location as an example, the feature data may be data of a certain downtown, a certain county, a certain road segment, and the like.

Step S104B, performing cluster analysis on a plurality of characteristic data of a plurality of sample users corresponding to each characteristic item, and analyzing to obtain a plurality of cluster centers corresponding to each characteristic item; and each characteristic item of each sample user is correspondingly associated with one clustering center.

Specifically, firstly, a Word2Vec model is trained, the corpus of the training model is from a user data set, Word segmentation, part-of-speech tagging and Word combination preprocessing work is needed aiming at the corpus, a skip-gram model is adopted in the training process, the size of a training window is 8, a sampling threshold is set to be 1e-4(1 x 10^ (-4)), the lowest frequency is set to be 5, if the frequency of occurrence of a Word and Word combination in a text is smaller than the threshold, the Word and Word combination can be abandoned, and finally the Word2Vec model in the target field is obtained, for example, the most common words such as' are too many and have no distinction;

and substituting the user data set into the Word2Vec model to obtain a Word vector of the user data set. The following takes clustering analysis based on the "accident occurrence location data" as an example, and the steps of clustering analysis are exemplarily described:

step 1, randomly extracting n accident occurrence place data from a plurality of accident occurrence place data in a user data set as a clustering center;

and 2, iteration is carried out: respectively calculating the distances from the accident occurrence place data except the n accident occurrence place data to the n clustering centers, and then selecting the clustering center closest to the accident occurrence place data as the belonged classification;

for example: euclidean distance (Euclidean distance), Square of Euclidean distance (Squared Euclidean distance), Manhattan distance (Block), Chebychev distance (Chebychev distance), Chi-Square distance (Chi-Square measure), and the like;

step 3, updating the clustering center: an averaging method may be used to calculate a new cluster center after each iteration; calculating the average value of the data of the accident occurrence places in the classification of the n clustering centers;

and 4, judging whether the new clustering center is changed after iteration, and if so, continuing to perform the steps 2-4 by taking the new clustering center generated in the step 3 as the clustering center until the clustering center is not changed.

The clustering center does not change, indicating that the clustering target has been reached, and thus the iteration operation can be stopped.

Step S104C, it is analyzed whether each sample user is associated with the same plurality of cluster centers as other sample users, and if the analysis results that more than two target sample users are associated with the same plurality of cluster centers, the more than two target sample users are mapped to one target community.

Specifically, the clustering analysis is based on a plurality of different dimensions, such as the place where the accident occurs, the place where the hit-and-miss person is located, and the like. Therefore, each dimension corresponds to a plurality of cluster centers, such as A county, B county and C county based on the location of the pitman residence. If the place of the user interface of the two target sample users is A county, the place of the accident is N region, and the time of the accident is P stage, it is indicated that the two target sample users have the group cheating suspicion, and the target sample users belong to the same type of target community.

Exemplarily, referring to fig. 3, step S104 further includes:

step S104E, performing cluster analysis on the target user data and the target historical accident data to obtain a cluster center of the target user data and the target historical accident data.

Step S104F, determining a target community corresponding to the clustering center of the target user data and the target historical accident data according to a fuzzy clustering algorithm.

Specifically, whether the target user is in a first preset range of one of the clustering centers is calculated through a fuzzy clustering algorithm; the "first preset range" is variable, and the preset range is changed by a weight coefficient of the cluster center. For example: if the clustering center of the target historical accident location is a certain county, the first preset range may be 50 kilometers centered on the certain county, and the like.

It should be noted that the target user may fill the target user data, such as a mobile phone number, by less than a certain amount or by missing a certain amount, or may calculate whether the target user is within a predetermined range of one of the cluster centers by adjusting the weight coefficient of the cluster center based on the currently grasped data.

And step S106, acquiring target weight coefficients of the target user data and the target historical accident data according to the target community.

Specifically, after the target community of the target personal data and the target historical accident data of the target user is obtained, the target weight coefficients of the target personal data and the target historical accident data are calculated according to the weight coefficients of the target community corresponding to the target personal data and the target historical accident data, and all the target weight coefficients are added to obtain a total weight coefficient.

Exemplarily, referring to fig. 4, step S106 further includes:

step S106A, obtaining a weighting factor of the target community corresponding to the target personal data and the target historical accident data, where the weighting factor of each community is preset.

Specifically, the weighting factor may be adjusted, and if the number of sample users associated with a certain target community is found to be too large, the weighting factor of the target community may be adjusted up. And adjusting the weight coefficient of the target community corresponding to the corresponding target personal data and the target historical accident data.

Step S106B, calculating a similarity coefficient between the target personal data and the target historical accident data and the corresponding target community.

Specifically, the cosine similarity can be used to calculate the similarity coefficient between the target personal data and the target historical accident data and the corresponding target community.

Step S106C, a target weight coefficient is calculated according to the weight coefficient and the similarity coefficient.

Specifically, a target weight coefficient is obtained by multiplying a weight coefficient corresponding to the target personal data and the target historical accident data by the similarity coefficient.

And S108, calculating based on the target weight coefficient to obtain a risk coefficient of the target user.

Specifically, all target weight coefficients obtained from the target personal data and the target historical accident data are added to obtain the risk coefficient of the target user.

Step S110, generating a risk warning page or a risk control instruction according to the risk coefficient, where the risk warning page is used to send and present in a target client, and the risk control instruction is used to control a target account of the target user or a processing state of a target order.

Specifically, a risk level is preset, and a corresponding risk level is obtained through a risk coefficient, for example: the risk factors (1, 0.8) are high risk classes, the risk factors (0.8, 0.6) are medium and high risk classes, the risk factors (0.6, 0.4) are medium and low risk classes, the risk factors (0.4, 0.2) are low risk classes, and the risk factors (0.2, 2) are small risk classes. And each risk level can give a corresponding processing scheme, a risk warning page is generated by the risk coefficient, the risk level and the processing scheme, and the risk warning page is sent to the target client. Or when the risk coefficient is larger than a certain threshold value, sending a risk control instruction to control the processing state of the target account or the target order of the target user, wherein the processing state is pause or return.

Example two

With reference to fig. 5, a schematic diagram of program modules of a risk identification system based on mass data according to a second embodiment of the present invention is shown. In the embodiment, the risk identification system 20 based on mass data may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to implement the present invention and implement the risk identification method based on mass data. The program modules referred to in the embodiments of the present invention refer to a series of computer program instruction segments capable of performing specific functions, and are more suitable than the programs themselves for describing the execution process of the risk identification system 20 based on mass data in the storage medium. The following description will specifically describe the functions of the program modules of the present embodiment:

the first obtaining module 200 is configured to obtain target user data of a target user.

Specifically, the target user data includes target personal data of the target user, for example: target user's personal data such as target ID card number, target license plate number, target telephone number, target name and target address. The target personal data of the target user can be input by the target user based on the user image interface of the first obtaining module 200, or obtained according to the target license plate number of the vehicle when the target user has an accident.

And the searching module 202 is configured to search the target historical accident data of the target user according to the target user data.

An analysis module 204, configured to analyze, through a pre-established relationship network, a target community to which the target user data and the target historical accident data belong.

Specifically, the relationship network pre-established by the analysis module 204 includes a plurality of communities corresponding to a plurality of target sample users, and is analyzed by the fuzzy clustering algorithm to determine target communities to which target user data and target historical accident data belong.

Illustratively, the analysis module 204 is further configured to:

Performing cluster analysis on a plurality of feature data of a plurality of sample users corresponding to each feature item, and analyzing to obtain a plurality of cluster centers corresponding to each feature item; each characteristic item of each sample user is correspondingly associated with a clustering center;

specifically, firstly, a Word2Vec model is trained, the corpus of the training model is from a user data set, Word segmentation, part-of-speech tagging and Word combination preprocessing work is needed aiming at the corpus, a skip-gram model is adopted in the training process, the size of a training window is 8, a sampling threshold is set to be 1e-4, the lowest frequency is set to be 5, if the occurrence frequency of a Word and Word combination in a text is smaller than the threshold, the Word and Word combination can be discarded, and finally, the Word2Vec model in the target field is obtained, for example, the most common words such as 'are' and the like, the occurrence frequency is too many, so that the Word2Vec model is not distinguishable;

Illustratively, the analysis module 204 is further configured to:

A second obtaining module 206, configured to obtain a target weight coefficient of each target user data and the target historical accident data according to the target community.

Illustratively, the second obtaining module 206 is further configured to:

And the calculating module 208 is configured to calculate to obtain a risk coefficient of the target user based on the target weight coefficient.

A third obtaining module 210, configured to generate a risk warning page or a risk control instruction according to the risk coefficient, where the risk warning page is used to send and present to a target client, and the risk control instruction is used to control a processing state of a target account or a target order of the target user.

EXAMPLE III

Fig. 6 is a schematic diagram of a hardware architecture of a computer device according to a third embodiment of the present invention. In the present embodiment, the computer device 2 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a preset or stored instruction. The computer device 2 may be a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers), and the like. As shown in fig. 6, the computer device 2 includes, but is not limited to, at least a memory 21, a processor 22, a network interface 23, and a risk identification system 20 based on mass data, which are communicatively connected to each other through a system bus. Wherein:

in this embodiment, the memory 21 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the computer device 2. Of course, the memory 21 may also comprise both internal and external memory units of the computer device 2. In this embodiment, the memory 21 is generally used for storing an operating system and various application software installed on the computer device 2, such as the program codes of the risk identification system 20 based on mass data in the second embodiment. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the computer device 2. In this embodiment, the processor 22 is configured to run the program codes stored in the memory 21 or process data, for example, run the risk identification system 20 based on mass data, so as to implement the risk identification method based on mass data according to the first embodiment.

The network interface 23 may comprise a wireless network interface or a wired network interface, and the network interface 23 is generally used for establishing communication connection between the server 2 and other electronic devices. For example, the network interface 23 is used to connect the server 2 to an external terminal via a network, establish a data transmission channel and a communication connection between the server 2 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, and the like. It is noted that fig. 6 only shows the computer device 2 with components 20-23, but it is to be understood that not all shown components are required to be implemented, and that more or less components may be implemented instead.

In this embodiment, the risk identification system 20 based on mass data stored in the memory 21 may be further divided into one or more program modules, and the one or more program modules are stored in the memory 21 and executed by one or more processors (in this embodiment, the processor 22) to complete the present invention.

For example, fig. 5 is a schematic diagram illustrating program modules of a second embodiment of implementing the risk identification system 20 based on mass data, in which the risk identification system 20 based on mass data may be divided into a first obtaining module 200, a searching module 202, an analyzing module 204, a second obtaining module 206, a calculating module 208, and a generating module 210. The program modules referred to herein are a series of computer program instruction segments capable of performing specific functions, and are more suitable than programs for describing the execution process of the risk identification system 20 based on mass data in the computer device 2. The specific functions of the program modules 200 and 210 have been described in detail in the second embodiment, and are not described herein again.

Example four

The present embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer-readable storage medium of this embodiment is used for storing the risk identification system 20 based on mass data, and when executed by the processor, the risk identification method based on mass data of the first embodiment is implemented.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A risk identification method based on mass data is characterized by comprising the following steps:

acquiring target user data of a target user;

2. The risk identification method according to claim 1, wherein the step of establishing the relationship network specifically comprises:

and analyzing whether each sample user is associated with the same plurality of clustering centers with other sample users, and if the more than two target sample users associated with the same plurality of clustering centers are obtained through analysis, mapping the more than two target sample users to one target community.

3. The risk identification method according to claim 1, wherein the step of analyzing the target community to which the target user data and the target historical accident data belong through a pre-established relationship network comprises:

4. The risk identification method according to claim 1, wherein the step of obtaining the target weight coefficient of each of the target user data and the target historical accident data according to the target community comprises:

5. A risk identification system based on mass data, comprising:

6. The risk identification system of claim 5, wherein the analysis module is further configured to:

7. The risk identification system of claim 5, wherein the analysis module is further configured to:

8. The risk identification system of claim 5, wherein the second acquisition module is further configured to:

9. Computer device, characterized in that it comprises a memory, a processor, said memory having stored thereon a mass data based risk identification system being executable on said processor, said mass data based risk identification system realizing the steps of the mass data based risk identification method according to any of the claims 1-4 when being executed by said processor.

10. A computer-readable storage medium, having stored therein a computer program, the computer program being executable by at least one processor to cause the at least one processor to perform the steps of the method for risk identification based on mass data according to any one of claims 1-4.