CN112288025B

CN112288025B - Abnormal case identification method, device, equipment and storage medium based on tree structure

Info

Publication number: CN112288025B
Application number: CN202011211514.3A
Authority: CN
Inventors: 殷振滔
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2020-11-03
Filing date: 2020-11-03
Publication date: 2024-04-30
Anticipated expiration: 2040-11-03
Also published as: CN112288025A

Abstract

The application discloses a method, a device, equipment and a storage medium for identifying abnormal cases based on a tree structure, belonging to the technical field of artificial intelligence, comprising the following steps: acquiring an original training sample from an initial case database; calculating abnormal scores of the original training samples based on IForest algorithm; comparing the abnormal scores of the original training samples with a preset threshold value, classifying the original training samples according to the comparison result, and forming a target training set; model training is carried out on the initial recognition model through the target training set, and an abnormal recognition model is output; and importing the case data of the case to be identified into an abnormal identification model, and outputting an identification result. The application also relates to blockchain techniques in which outlier scores of original training samples may be stored. According to the method, the abnormal score of the original training sample is calculated based on IForest algorithm, and the process of classifying the original training sample according to the abnormal score effectively eliminates the interference of human factors, so that the accuracy of the case abnormality identification model is improved.

Description

Abnormal case identification method, device, equipment and storage medium based on tree structure

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a method, a device, equipment and a storage medium for identifying abnormal cases based on a tree structure.

Background

Traditional case anomaly identification often adopts a method of investigation by an investigation person and a statistical model, but both adopt artificial judgment as a reference, and the artificial judgment is difficult to quantify and has subjective factors. In the artificial judgment process, abnormal cases are usually marked as positive samples, other cases are considered as non-abnormal cases, namely negative samples, and the proportion of abnormal cases is generally far smaller than that of non-abnormal cases in the artificial judgment process, so that the accuracy of the case abnormal recognition two classifiers obtained based on the training of the artificially marked data is not high enough. Because the positive samples (i.e., abnormal cases) in the training data are often very small in proportion, and the abnormal forms are various, the non-abnormal case samples are actually impure, i.e., the non-abnormal samples are artificially judged to have a possible missing condition, so that the non-abnormal samples are doped with partial abnormal cases, the situation means that the abnormal case distribution of the original historical data is different from the actual distribution, and the unrecognized abnormal samples are partial dirty data, so that the effect of the classifier is affected.

Disclosure of Invention

The embodiment of the application aims to provide a method, a device, equipment and a storage medium for identifying abnormal cases based on a tree structure, so as to solve the technical problem of insufficient accuracy of the existing case identification model obtained through artificial annotation data training.

In order to solve the above technical problems, the embodiment of the present application provides a method for identifying abnormal cases based on a tree structure, which adopts the following technical scheme:

An abnormal case identification method based on a tree structure, comprising the following steps:

Acquiring an original training set in a preset time period from an initial case database, wherein the original training set comprises a plurality of original training samples;

calculating an abnormal score of each original training sample in the original training set based on a random isolated forest algorithm;

comparing the abnormal score of each original training sample with a preset threshold value, and classifying the original training samples according to the comparison result to obtain positive samples and negative samples;

randomly combining the positive sample and the negative sample obtained by recognition to form a target training set;

constructing an initial recognition model, carrying out model training on the initial recognition model through a target training set, and outputting an abnormal recognition model;

Acquiring case data of a case to be identified, importing the case data of the case to be identified into an abnormal identification model, and outputting an identification result.

Further, the step of calculating the abnormal score of each original training sample in the original training set based on the random isolated forest algorithm specifically comprises the following steps:

constructing a binary tree through a plurality of original training samples in an original training set;

the path length of each original training sample in the binary tree is calculated, and based on the path length, the outlier score of each original training sample is calculated.

Further, the step of constructing a binary tree from a plurality of original training samples in the original training set specifically includes:

Extracting a plurality of original training samples from an original training set, and importing the extracted plurality of original training samples into a preset initial binary tree model;

Acquiring sample characteristics of each original training sample, and combining the acquired sample characteristics to form a characteristic set;

dividing the original training set by the feature set until the original training sample of the original training set is not subdivided, and outputting a binary tree.

Further, the original training set is divided by the feature set until the original training sample of the original training set is not subdivided, and a binary tree is output, which specifically includes:

randomly extracting sample features in the feature set in sequence, and determining the maximum value and the minimum value of the extracted sample features;

Randomly selecting a numerical value between the maximum value and the minimum value as a cutting point, and dividing an original training sample of an original training set;

and traversing sample features of the feature set until the depth of the binary tree meets the preset depth, and acquiring the binary tree with the depth meeting the requirement.

Further, the step of calculating the path length of each original training sample in the binary tree, and calculating the anomaly score of each original training sample based on the path length specifically includes:

Counting the number of edges of each original training sample in a binary tree, and calculating the initial path length of each original training sample in the binary tree according to the number of edges;

Calculating a path correction value, and correcting the initial path length of each original training sample through the path correction value to obtain the path length of each original training sample in a binary tree;

and calculating the abnormal score of each original training sample through the path length.

Further, constructing an initial recognition model, performing model training on the initial recognition model through a target training set, and outputting an abnormal recognition model, wherein the method specifically comprises the following steps:

Randomly segmenting a target training set into training subsets of K equal parts, wherein K is a positive integer;

Randomly extracting K-1 training subsets to form a model training set, and carrying out model training on the initial recognition model;

taking the rest training subset as a cross verification set, performing cross authentication on the trained initial recognition model, and outputting a first verification result;

and carrying out iterative updating on the initial recognition model according to the first verification result until the initial recognition model is converged, and outputting an abnormal recognition model after the convergence of the model.

Further, the step of iteratively updating the initial recognition model according to the first verification result until the initial recognition model converges, and outputting the abnormal recognition model after the model convergence specifically includes:

model parameters of the initial recognition model are adjusted, and the model training set is used for training the model of the initial recognition model after the parameters are modified; and

Cross-authenticating the trained initial recognition model through the cross-authentication set, and outputting a second authentication result;

Comparing the first verification result with the second verification result, if the first verification result is different from the second verification result, continuing to adjust the model parameters of the initial recognition model until training to obtain the first verification result and the second verification result which are the same, and outputting an abnormal recognition model after model convergence.

In order to solve the technical problems, the embodiment of the application also provides an abnormal case identification device based on a tree structure, which adopts the following technical scheme:

An abnormal case recognition apparatus based on a tree structure, comprising:

The acquisition module is used for acquiring an original training set in a preset time period from the initial case database, wherein the original training set comprises a plurality of original training samples;

the calculation module is used for calculating the abnormal score of each original training sample in the original training set based on a random isolated forest algorithm;

The comparison module is used for comparing the abnormal score of each original training sample with a preset threshold value, classifying the original training samples according to the comparison result, and obtaining positive samples and negative samples;

The combination module is used for randomly combining the positive sample and the negative sample obtained by recognition to form a target training set;

the training module is used for constructing an initial recognition model, carrying out model training on the initial recognition model through a target training set and outputting an abnormal recognition model;

the recognition module is used for acquiring the case data of the case to be recognized, importing the case data of the case to be recognized into the abnormal recognition model and outputting a recognition result.

In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:

A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which when executed by the processor implement the steps of the tree-structured abnormal case identification method as described above.

In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:

A computer readable storage medium having computer readable instructions stored thereon which when executed by a processor perform the steps of the tree structure based abnormal case identification method described above.

Compared with the prior art, the embodiment of the application has the following main beneficial effects:

When a model training set is constructed, the abnormal score of each original training sample is calculated through IForest algorithm, then the magnitude relation between the abnormal score of each original training sample and a preset threshold value is sequentially compared, the original training samples are classified according to the comparison result to obtain positive samples and negative samples, then an abnormal recognition model is trained on the model training set obtained in the mode, and then the abnormal condition of a case to be recognized is recognized through the trained abnormal recognition model. According to the application, the abnormal score of the original training sample is calculated based on IForest algorithm, and the process of classifying the original training sample according to the abnormal score effectively eliminates the interference of human factors, reduces the influence of subjective factors, and can change the preset threshold according to actual conditions to improve the proportion of positive and negative samples in the training set, so that the problem that the accuracy of the case abnormal recognition model obtained by training is not high enough due to too few positive samples in the prior art is solved.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.

FIG. 1 illustrates an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 illustrates a flow chart of one embodiment of a tree structure based abnormal case identification method in accordance with the present application;

FIG. 3 shows a flow chart of one embodiment of step S202 in FIG. 2;

FIG. 4 is a schematic diagram of the construction of a binary tree in an embodiment of the application;

FIG. 5 is a schematic diagram illustrating one embodiment of a tree structure based abnormal case recognition apparatus in accordance with the present application;

fig. 6 shows a schematic structural diagram of an embodiment of a computer device according to the application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture ExpertsGroup Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving PictureExperts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the method for identifying abnormal cases based on the tree structure provided by the embodiment of the present application is generally executed by a server, and accordingly, the device for identifying abnormal cases based on the tree structure is generally disposed in the server.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flowchart of one embodiment of a method of tree-structure based anomaly case identification is shown in accordance with the present application. The abnormal case identification method based on the tree structure comprises the following steps:

S201, acquiring an original training set in a preset time period from an initial case database, wherein the original training set comprises a plurality of original training samples;

Specifically, the original training set is a data set of all cases in a preset time in an initial case database, wherein data information of all cases is stored in the initial case database, the cases in the initial case database can be regarded as original training samples during model training, and the cases in the initial case database are unprocessed cases. The related information of the abnormal case in the vehicle insurance claim comprises data such as a case number, a case involving person, a case involving vehicle, a related certificate and the like, and the case involving person mainly comprises insured persons, repair shop persons, insurance company persons, related traffic police and the like, which are all recorded original information when the case occurs. It should be noted that, the cases in the initial case database may be financial claim cases or major illness claim cases, etc., and the present application is not limited herein.

In this embodiment, the electronic device (for example, the server shown in fig. 1) on which the tree structure-based abnormal case identification method operates may receive the user request through a wired connection manner or a wireless connection manner for calibration. It should be noted that the wireless connection may include, but is not limited to, 3G/4G connection, wiFi connection, bluetooth connection, wiMAX connection, zigbee connection, UWB (ultra wideband) connection, and other now known or later developed wireless connection.

S202, calculating an abnormal score of each original training sample in the original training set based on a random isolated forest algorithm;

Wherein, the analysis of the abnormal score is a process of checking whether the data contains unreasonable data, and it is dangerous to ignore the existence of the abnormal score in the model training process. In a specific embodiment of the present application, the anomaly score of a case can be calculated by IForest (random isolated forest) algorithm, IForest algorithm belongs to a mixed anomaly score analysis method of Non-parameter and unsupervised (Non-parametric statistical method), i.e. no mathematical model is defined and no labeled training is required. IForest consists of t iTree (Isolation Tree) orphan trees, each iTree being a binary tree structure. For how to find out which points are easily isolated (isolated), iForest uses a very efficient strategy, namely, assuming that a random hyperplane is used to cut (split) the data space (DATA SPACE), two subspaces can be generated at a time, then a random hyperplane is used to cut each subspace again, and the loop goes on until there is only one data point in each subspace. Intuitively, it can be found that clusters of very high density can be cut many times to stop cutting, but points of very low density can easily stop to a subspace very early.

Specifically, a binary tree is constructed by using the original training samples in the original training set, data information of the formed binary tree is obtained, and an abnormal score of each original training sample in the original training set is calculated according to the data information of the binary tree.

S203, comparing the abnormal score of each original training sample with a preset threshold value, and classifying the original training samples according to the comparison result to obtain positive samples and negative samples;

Specifically, using Iforest algorithm to obtain abnormal scores of all original training samples on the original training set, wherein the range of the abnormal scores is [0,1], a preset threshold is set, if the abnormal scores are the preset threshold0.8, the abnormal scores of each original training sample are compared with the preset threshold in sequence, the types of the original training samples are identified, cases higher than the threshold are selected in the original training set to be added into positive samples (marked as abnormal cases), and cases lower than the threshold are selected to be added into negative samples (marked as normal cases). In a specific embodiment of the present application, a preset threshold may be set according to a requirement of a time scenario, for example, when the requirement is identified to be strict, the preset threshold is adjusted upward.

S204, randomly combining the positive sample and the negative sample obtained by recognition to form a target training set;

Specifically, after classifying the original training samples, positive samples and negative samples are obtained, the positive samples and the negative samples are recombined to generate a target training set, and the initial recognition model is trained and verified through the target training set.

S205, constructing an initial recognition model, carrying out model training on the initial recognition model through a target training set, and outputting an abnormal recognition model;

specifically, the initial recognition model can be constructed by adopting a K-Fold cross-validation framework, so that the subsequent cross-validation is convenient, training set data is randomly segmented into K parts, K-1 parts are used as model training sets, the rest 1 parts are used as cross-validation sets, the initial recognition model is validated by a cross-validation method, the overfitting of the model is reduced, and the robustness is improved. And forming a mature abnormality recognition model through K supervised classifiers.

It should be noted that in many cases, the effect of a single machine learning model is not good, so in a specific embodiment of the present application, a model stacking is adopted when training an anomaly recognition model, that is, training is performed by using multiple base classifiers together, finally forming the anomaly recognition model, and then another sub-classifier is used to organize and utilize the base classifier, that is, taking the answer of the base model as input, so that the sub-classifier learns to organize and assign weights to the answer of the base model, and the aim is to reduce generalization errors. The above-mentioned base classifier and sub-classifier are both one of the classifiers. Specifically, the principle of stacking model is to adopt a plurality of two classifiers, wherein the two classifiers comprise K base classifiers and 1 sub-classifier, randomly divide training set data into K parts, use K-1 parts of training set data as model training set and the rest 1 part of training set data as cross verification set to circularly train the base classifiers, finally obtain K base classifiers after training, output the verification results of K base classifiers, randomly combine K-1 parts of verification results as sub-classifier training data set, circularly train the sub-classifiers by using the rest 1 part of verification results as cross verification set, finally obtain trained sub-base classifier, and form a convergent case abnormal recognition model through the combination of 1 sub-classifier and K base classifiers. In a specific embodiment of the application, the base classifier selects either the LightGBM model or the CatBoost model and the sub-classifier selects the Logistics Regression model.

S206, acquiring case data of the case to be identified, importing the case data of the case to be identified into an abnormal identification model, and outputting an identification result.

Specifically, after training the abnormal recognition model, acquiring case data of the case to be recognized, and importing the case data of the case to be recognized into the abnormal recognition model can directly acquire an abnormal recognition result of the case to be recognized.

The embodiment of the application discloses an abnormal case identification method based on a tree structure, which comprises the steps of calculating an abnormal score of each original training sample through IForest algorithm when a model training set is constructed, sequentially comparing the abnormal score of each original training sample with the magnitude relation of a preset threshold value, classifying the original training samples according to comparison results to obtain a positive sample and a negative sample, training an abnormal identification model based on the model training set obtained in the mode, and identifying the abnormal condition of the case to be identified through the trained abnormal identification model. According to the application, the abnormal score of the original training sample is calculated based on IForest algorithm, and the process of classifying the original training sample according to the abnormal score effectively eliminates the interference of human factors, reduces the influence of subjective factors, and can change the preset threshold according to actual conditions to improve the proportion of positive and negative samples in the training set, so that the problem that the accuracy of the case abnormal recognition model obtained by training is not high enough due to too few positive samples in the prior art is solved.

Further, referring to fig. 3, fig. 3 is a flowchart showing a specific embodiment of step S202 in fig. 2, and the step of calculating the anomaly score of each original training sample in the original training set based on the random isolated forest algorithm specifically includes:

s301, constructing a binary tree through a plurality of original training samples in an original training set;

S302, calculating the path length of each original training sample in the binary tree, and calculating the abnormal score of each original training sample based on the path length.

In the above embodiment, the original training samples in the original training set are imported into the root node of the tree model, then the original training samples are divided according to a certain condition, and are filled into the leaf nodes of the tree model to form a binary tree, the path length of each original training sample on the binary tree is counted, and based on the obtained path length, the abnormal score of each original training sample can be calculated.

In the above embodiment, the sample feature of each original training sample is taken, and the obtained sample features are combined to form a feature set, and the original training set is divided by the feature set. It should be noted that, when the feature set is used for dividing the original training set, a sample feature is randomly acquired from the feature set to divide the original training set, so as to obtain a subset of the original training set, another sample feature is randomly acquired from the feature set to divide the subset of the original training set, and the sample features in all the feature sets are traversed until the original training sample of the original training set is not subdivided, at this time, the divided original training sample is filled into leaf nodes of the tree model, so as to obtain a binary tree. It should be noted that, the sequence of the sample features is adjusted, so that a plurality of binary trees can be obtained, and the abnormal score is calculated by adopting the combination of the plurality of binary trees, so that the accuracy of sample division can be improved.

In the above embodiment, when a sample feature is randomly obtained from the feature set to divide the original training set, the maximum value and the minimum value of the sample feature may be determined, for example, referring to fig. 4, fig. 4 shows a schematic diagram of construction of a binary tree in the embodiment of the present application, the 10 original training samples are divided by the sample feature of "age", the 10 original training samples are aged up to 56 years, the minimum is aged up to 28 years, a value (e.g. 40) is randomly selected between 28 and 56 as a cutting point to divide the 10 original training samples, then the original training samples with the age less than 40 are placed in a new leaf node 1, the original training samples with the age greater than or equal to 40 are placed in a new leaf node 2, where the leaf node 1 is located on the left side of the root node, and the leaf node 2 is located on the right side of the root node. In the above embodiment, the original training samples in the leaf nodes 1 and 2 are divided by the sample feature of "vehicle age", the division results are respectively put into the leaf nodes 3,4,5 and 6, and the sample feature of the feature set is traversed by such a division manner until the depth of the binary tree meets the preset depth, i.e. the original training sample cannot be subdivided, so that the binary tree with depth meeting the requirement can be obtained.

In the above embodiment, when calculating the anomaly score of a certain original training sample x, the path length (or depth) of the anomaly score in each binary tree is calculated. Specifically, firstly, along a binary tree, starting from a root node, taking values according to different sample characteristics from top to bottom until leaf nodes cannot be subdivided, and counting the number e of edges passing through in the process of x from the root node to the leaf nodes to obtain an initial path length h ₀ (x), namely h ₀ (x) =e. In the above specific embodiment, in order to accurately obtain the abnormal score of the original training sample x, the initial path length h ₀ (x) needs to be corrected, specifically, a path correction value is added to the initial path length h ₀ (x), and assuming that the number of samples in the original training sample, which falls on the same leaf node as x, is n, the path length h (x) of x on the binary tree can be calculated by the following formula:

h(x)＝h₀(x)+C(n)

I.e. h (x) =e+c (n), where e represents the number of edges that the original training sample x passes from the root node to the leaf node, and C (n) is a path correction value representing the average path length of n samples that fall on the same leaf node as x. In general, the calculation formula of C (n) is as follows:

Wherein, H (n-1) can be estimated by ln (n-1) +m, where the constant M is an euler constant, and its value is 0.5772156649, and the final anomaly Score (x) of the original training sample x can be obtained by integrating the results of a plurality of binary trees:

e (h (x)) represents the average of the path lengths of the data x over a plurality of binary trees, Sample number of training samples representing a single binary tree,/>Representation/>The average path length of the binary tree of bar data construction, which is used primarily herein for normalization.

In the embodiment, the training set data is randomly segmented into K parts, K-1 parts are used as a model training set, the rest 1 parts are used as a cross verification set to circularly train the base classifier, finally K base classifiers after training are obtained, and the K base classifiers after training are integrated through 1 classifier to form a converged case anomaly identification model. In a specific embodiment of the application, the base classifier selects either the LightGBM model or the CatBoost model and the sub-classifier selects the Logistics Regression model. In a specific embodiment of the application, the training set data is randomly segmented into 10 parts of [1,2,3,4,5,6,7,8,9,10], K-1 parts of training subsets are randomly extracted and combined to form a model training set, such as [1,2,3,4,5,6,7,8,9] is used for training the K1 classifier, and [10] is used for verifying the trained K1 classifier; and [1,2,3,4,5,6,7,8,10] is used for training the K2 classifier, and [9] is used for verifying that the trained K2 classifier … … can train 10 classifiers K1, … … and K10 in the mode, and the 10 base classifiers which are integrated and trained through 1 classifier are used for obtaining an anomaly identification model. Meanwhile, 10 verification results can be obtained by carrying out cross verification on the 10 classifiers, and the average value of the 10 verification results is taken and is the first verification result.

In the specific embodiment, the verification results of the 10 base classifiers are output through verification sets of verification classifiers K1, … … and K10 respectively, then 9 verification results are randomly combined to serve as a sub-classifier training data set, the rest 1 verification result serves as a cross verification set to circularly train the sub-classifier, finally the trained sub-base classifier is obtained, and a case abnormality recognition model is obtained through combination of the 1 sub-classifier and the 10 base classifiers.

In the above embodiment, after the first verification result is obtained, the initial recognition model is iteratively updated according to the first verification result. Specifically, the model parameters of the initial recognition model are adjusted, namely, a step length parameter step is added on the basis of the model parameters of the initial recognition model, the model training set is used for training the initial recognition model after the parameters are modified, the cross verification set is used for carrying out cross authentication on the trained initial recognition model, a second verification result is output, the first verification result and the second verification result are compared, how the first verification result and the second verification result are different, the model parameters of the initial recognition model are continuously adjusted until the training results are the same as the first verification result and the second verification result, and the abnormal recognition model after the model convergence is output.

It should be noted that, the original training sample verified by Iforest algorithm needs to be verified twice by using the step parameter, the step parameter step is improved, for example, 0.01, and a state that the average value of the predicted result is not changed is found by continuous iteration, so that the quality of the abnormal sample is ensured, and an optimal recognition result can be achieved.

It should be emphasized that, to further ensure the privacy and security of the anomaly scores of the original training samples, the anomaly scores of the original training samples may also be stored in nodes of a blockchain.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

Those skilled in the art will appreciate that implementing all or part of the processes of the methods of the embodiments described above may be accomplished by way of computer readable instructions, stored on a computer readable storage medium, which when executed may comprise processes of embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

With further reference to fig. 5, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a tree-structure-based abnormal case recognition apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 5, the abnormal case identification apparatus based on the tree structure according to the present embodiment includes:

an obtaining module 501, configured to obtain an original training set in a predetermined time period from an initial case database, where the original training set includes a plurality of original training samples;

the calculating module 502 is configured to calculate an abnormal score of each original training sample in the original training set based on a random isolated forest algorithm;

a comparison module 503, configured to compare the abnormal score of each original training sample with a preset threshold, and classify the original training samples according to the comparison result, where a positive sample and a negative sample are obtained;

A combining module 504, configured to randomly combine the positive sample and the negative sample obtained by the recognition to form a target training set;

The training module 505 is configured to construct an initial recognition model, perform model training on the initial recognition model through a target training set, and output an abnormal recognition model;

the identifying module 506 is configured to obtain case data of a case to be identified, import the case data of the case to be identified into the abnormal identifying model, and output an identifying result.

Further, the computing module 502 specifically includes:

the binary tree construction submodule is used for constructing a binary tree through a plurality of original training samples in the original training set;

and the path calculation sub-module is used for calculating the path length of each original training sample in the binary tree and calculating the abnormal score of each original training sample based on the path length.

Further, the binary tree construction submodule specifically includes:

The sample importing unit is used for extracting a plurality of original training samples from the original training set and importing the extracted plurality of original training samples into a preset initial binary tree model;

The feature combination unit is used for acquiring sample features of each original training sample and combining the acquired sample features to form a feature set;

The sample dividing unit is used for dividing the original training set through the feature set until the original training sample of the original training set is not subdivided, and outputting a binary tree.

Further, the sample dividing unit specifically includes:

The feature extraction subunit is used for randomly extracting sample features in the feature set in sequence and determining the maximum value and the minimum value of the extracted sample features;

The sample dividing subunit is used for randomly selecting a numerical value between the maximum value and the minimum value as a cutting point and dividing an original training sample of the original training set;

and the binary tree output subunit is used for traversing the sample features of the feature set until the depth of the binary tree meets the preset depth, and acquiring the binary tree with the depth meeting the requirement.

Further, the path computation submodule specifically includes:

the statistics unit is used for counting the number of edges of each original training sample in the binary tree, and calculating the initial path length of each original training sample in the binary tree according to the number of edges;

the correction unit is used for calculating a path correction value, and correcting the initial path length of each original training sample through the path correction value to obtain the path length of each original training sample in the binary tree;

And the calculating unit is used for calculating the abnormal score of each original training sample through the path length.

Further, the training module 505 specifically includes:

The molecule cutting module is used for randomly cutting the target training set into K equal training subsets, wherein K is a positive integer;

the training sub-module is used for randomly extracting K-1 training subsets to form a model training set, and carrying out model training on the initial recognition model;

the verification sub-module is used for taking the rest training subset as a cross verification set, carrying out cross authentication on the initial recognition model after training, and outputting a first verification result;

and the iteration sub-module is used for carrying out iteration update on the initial recognition model according to the first verification result until the initial recognition model is converged, and outputting an abnormal recognition model after the convergence of the model.

Further, the iteration sub-module specifically includes:

the parameter adjusting unit is used for adjusting the model parameters of the initial recognition model and training the model of the initial recognition model after the parameters are modified through the model training set; and

The cross verification unit is used for carrying out cross authentication on the trained initial recognition model through the cross verification set and outputting a second verification result;

And the comparison unit is used for comparing the first verification result with the second verification result, if the first verification result is different from the second verification result, continuing to adjust the model parameters of the initial recognition model until training to obtain the first verification result which is the same as the second verification result, and outputting an abnormal recognition model after the model is converged.

The embodiment of the application discloses an abnormal case identification device based on a tree structure, which is characterized in that when a model training set is constructed, an abnormal score of each original training sample is calculated through IForest algorithm, then the abnormal score of each original training sample is sequentially compared with the magnitude relation of a preset threshold value, the original training samples are classified according to the comparison result to obtain a positive sample and a negative sample, then an abnormal identification model is trained based on the model training set obtained in the mode, and the abnormal condition of the case to be identified is identified through the trained abnormal identification model. According to the application, the abnormal score of the original training sample is calculated based on IForest algorithm, and the process of classifying the original training sample according to the abnormal score effectively eliminates the interference of human factors, reduces the influence of subjective factors, and can change the preset threshold according to actual conditions to improve the proportion of positive and negative samples in the training set, so that the problem that the accuracy of the case abnormal recognition model obtained by training is not high enough due to too few positive samples in the prior art is solved.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 6, fig. 6 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 6 comprises a memory 61, a processor 62, a network interface 63 communicatively connected to each other via a system bus. It is noted that only computer device 6 having components 61-63 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 61 includes at least one type of readable storage media including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 61 may be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6. In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 6. Of course, the memory 61 may also comprise both an internal memory unit of the computer device 6 and an external memory device. In this embodiment, the memory 61 is generally used for storing an operating system and various application software installed on the computer device 6, such as computer readable instructions of a tree-structure-based abnormal case recognition method. Further, the memory 61 may be used to temporarily store various types of data that have been output or are to be output.

The processor 62 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 62 is typically used to control the overall operation of the computer device 6. In this embodiment, the processor 62 is configured to execute computer readable instructions stored in the memory 61 or process data, for example, execute computer readable instructions of the tree structure-based abnormal case identification method.

The network interface 63 may comprise a wireless network interface or a wired network interface, which network interface 63 is typically used for establishing a communication connection between the computer device 6 and other electronic devices.

The embodiment of the application discloses computer equipment, which is used for calculating the abnormal score of each original training sample through IForest algorithm when a model training set is constructed, then sequentially comparing the magnitude relation between the abnormal score of each original training sample and a preset threshold value, and classifying the original training samples according to the comparison result, wherein the types of the original training samples comprise positive samples and negative samples, then training an abnormal recognition model based on the model training set obtained in the mode, and then recognizing the abnormal condition of a case to be recognized through the trained abnormal recognition model. According to the application, the abnormal score of the original training sample is calculated based on IForest algorithm, and the process of classifying the original training sample according to the abnormal score effectively eliminates the interference of human factors, reduces the influence of subjective factors, and can change the preset threshold according to actual conditions to improve the proportion of positive and negative samples in the training set, so that the problem that the accuracy of the case abnormal recognition model obtained by training is not high enough due to too few positive samples in the prior art is solved.

The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the tree-structure-based abnormal case identification method as described above.

The embodiment of the application discloses a computer readable storage medium, which is used for calculating the abnormal score of each original training sample through IForest algorithm when a model training set is constructed, then sequentially comparing the magnitude relation between the abnormal score of each original training sample and a preset threshold value, and classifying the original training samples according to the comparison result, wherein the types of the original training samples comprise positive samples and negative samples, then training an abnormal recognition model based on the model training set obtained in the mode, and then recognizing the abnormal condition of a case to be recognized through the trained abnormal recognition model. According to the application, the abnormal score of the original training sample is calculated based on IForest algorithm, and the process of classifying the original training sample according to the abnormal score effectively eliminates the interference of human factors, reduces the influence of subjective factors, and can change the preset threshold according to actual conditions to improve the proportion of positive and negative samples in the training set, so that the problem that the accuracy of the case abnormal recognition model obtained by training is not high enough due to too few positive samples in the prior art is solved.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. The abnormal case identification method based on the tree structure is characterized by comprising the following steps of:

Constructing an initial recognition model, carrying out model training on the initial recognition model through the target training set, and outputting an abnormal recognition model;

acquiring case data of a case to be identified, importing the case data of the case to be identified into the abnormal identification model, and outputting an identification result;

The step of calculating the abnormal score of each original training sample in the original training set based on the random isolated forest algorithm specifically comprises the following steps:

constructing a binary tree through a plurality of original training samples in the original training set;

Calculating the path length of each original training sample in the binary tree, and calculating the abnormal score of each original training sample based on the path length;

The step of calculating the path length of each original training sample in the binary tree and calculating the abnormal score of each original training sample based on the path length specifically comprises the following steps:

Calculating an abnormal score of each original training sample through the path length;

calculating an anomaly score for each of the raw training samples by the following formula: ；

Where x is the original training sample, score (x) is the outlier Score of the original training sample x, h (x) is the path length of the original training sample x on the binary tree, E (h (x)) represents the average of the path lengths of the data x on the multiple binary trees, Sample number of training samples representing a single binary tree,/>Representation/>The average path length of a binary tree constructed from strips of data, where h (x) can be calculated using the following equation: /(I)；

Wherein h ₀ (x) is the initial path length of the original training sample x, and C (n) is the path correction value;

the formula for C (n) is as follows: ；

wherein H (n-1) is calculated based on ln (n-1) +M, M is Euler constant, and n is the number of samples in the original training sample, which are located at the same leaf node as the training sample x;

The step of constructing a binary tree by using a plurality of original training samples in the original training set specifically includes:

Extracting a plurality of original training samples from the original training set, and importing the extracted plurality of original training samples into a preset initial binary tree model;

Obtaining sample characteristics of each original training sample, and combining the obtained sample characteristics to form a characteristic set;

dividing the original training set through the feature set until an original training sample of the original training set is not subdivided, and outputting the binary tree;

the step of dividing the original training set by the feature set until the original training sample of the original training set is not subdivided, and outputting the binary tree specifically includes:

randomly selecting a numerical value between the maximum value and the minimum value as a cutting point, and dividing an original training sample of the original training set;

2. The method for identifying abnormal cases based on tree structure according to claim 1, wherein the steps of constructing an initial identification model, performing model training on the initial identification model through the target training set, and outputting the abnormal identification model specifically comprise:

randomly segmenting the target training set into K equal training subsets, wherein K is a positive integer;

Randomly extracting K-1 parts of training subsets to form a model training set, and carrying out model training on the initial recognition model;

taking the rest training subset as a cross verification set, carrying out cross verification on the initial recognition model after training, and outputting a first verification result;

3. The abnormal case identification method based on the tree structure as claimed in claim 2, wherein the step of iteratively updating the initial identification model according to the first verification result until the initial identification model converges, and outputting the abnormal identification model after model convergence specifically comprises:

Model parameters of the initial recognition model are adjusted, and the model training set is used for training the initial recognition model after parameter modification; and

Cross-authenticating the trained initial recognition model through a cross-authentication set, and outputting a second authentication result;

Comparing the first verification result with the second verification result, if the first verification result is different from the second verification result, continuing to adjust the model parameters of the initial recognition model until training is carried out to obtain the first verification result and the second verification result which are the same, and outputting an abnormal recognition model after model convergence.

4. A tree-structure-based abnormal case recognition apparatus, wherein the tree-structure-based abnormal case recognition apparatus implements the steps of the tree-structure-based abnormal case recognition method according to any one of claims 1 to 3, the tree-structure-based abnormal case recognition apparatus comprising:

the comparison module is used for comparing the abnormal score of each original training sample with a preset threshold value, and classifying the original training samples according to comparison results, wherein the types of the original training samples comprise positive samples and negative samples;

the training module is used for constructing an initial recognition model, carrying out model training on the initial recognition model through the target training set and outputting an abnormal recognition model;

The identification module is used for acquiring the case data of the case to be identified, importing the case data of the case to be identified into the abnormal identification model and outputting an identification result.

5. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which when executed by the processor implement the steps of the tree structure based abnormal case identification method of any one of claims 1 to 3.

6. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the tree structure based abnormal case identification method according to any of claims 1 to 3.