CN111444931B

CN111444931B - Method and device for detecting abnormal access data

Info

Publication number: CN111444931B
Application number: CN201910043579.2A
Authority: CN
Inventors: 文宏雕; 杨立军
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-01-17
Filing date: 2019-01-17
Publication date: 2024-06-18
Anticipated expiration: 2039-01-17
Also published as: CN111444931A

Abstract

The invention discloses a method and a device for detecting abnormal access data, and relates to the technical field of computers. One embodiment of the method comprises the following steps: training the abnormal access data detection model according to a pre-established initial training set; after the training is finished, determining the abnormal probability of access data in the initial verification set established in advance by using an abnormal access data detection model; marking a plurality of access data with the abnormal probability and the accuracy rate meeting preset discrimination conditions in the initial verification set as abnormal access data, and adding the abnormal access data into an initial training set; training the abnormal access data detection model according to the current training set; and inputting the access data to be detected into the abnormal access data detection model after training is completed, and judging whether the access data to be detected is the abnormal access data or not according to an output result. According to the embodiment, the abnormal access data detection model can be continuously optimized by utilizing the dynamically adjusted training set, so that the abnormal detection accuracy of the model is improved.

Description

Method and device for detecting abnormal access data

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for detecting abnormal access data.

Background

In the field of computer and internet technology, the main carrier for web page access is access data in the form of linked strings. In practical applications, some malicious requests generate abnormal access data in a way that the data is used to bypass supervision, and the way of cheating comprises: structured query language SQL (Structured Query Language) injection, XSS attacks (i.e., cross-site scripting attacks), malicious intent verification program POC (Proof Of Concept), and the like. Normally, the normal access data accords with a standard format, and the abnormal access data has certain content randomness.

In specific application, the abnormal access data modes are numerous and change in multiple sections, but the number of the pre-marked abnormal access data is relatively small, so that the existing abnormal access data detection algorithm generally adopts an unsupervised learning algorithm such as statistics, matrix decomposition, isolated Forest Isolation Forest or a deep network, and the like, the algorithm cannot utilize marked abnormal access data, cannot further learn useful feature combinations with the help of the marked data, and has small lifting space and low detection accuracy.

Disclosure of Invention

In view of this, the embodiment of the invention provides a method and a device for detecting abnormal access data, which can utilize a training set which is dynamically adjusted to continuously optimize an abnormal access data detection model, so as to improve the accuracy of abnormal detection of the model.

In order to achieve the above object, according to one aspect of the present invention, there is provided a method of detecting abnormal access data.

The method for detecting the abnormal access data comprises the following steps: training the abnormal access data detection model according to a pre-established initial training set; after the training is finished, determining the abnormal probability of access data in the initial verification set established in advance by using an abnormal access data detection model; the initial verification set comprises pre-marked abnormal access data and unmarked access data; marking a plurality of access data with the abnormal probability and the accuracy rate meeting preset discrimination conditions in the initial verification set as abnormal access data, and adding the abnormal access data into an initial training set; training the abnormal access data detection model according to the current training set; the accuracy is the proportion of abnormal access data marked in advance in the plurality of access data; and inputting the access data to be detected into the abnormal access data detection model after training is completed, and judging whether the access data to be detected is the abnormal access data or not according to an output result.

Optionally, the method further comprises: removing the plurality of access data from the initial verification set after adding the plurality of access data to the initial training set; performing at least one iteration training on the abnormal access data detection model; wherein, in any one iterative training: training the abnormal access data detection model according to the current training set; after the training is finished, determining the abnormal probability of the access data in the current verification set by using an abnormal access data detection model; marking a plurality of access data with abnormal probability and accuracy rate meeting the discrimination conditions in the current verification set as abnormal access data, and adding the current training set to form a training set of the next iterative training; the plurality of access data is removed from the current validation set to form a validation set for a next iterative training.

Optionally, in any one of the iterative training, the discrimination conditions include: selecting access data from the current verification set in a put-back manner n times according to the abnormal probability descending order of the access data and the preset initial quantity and the increment quantity; wherein n is a positive integer; and determining the selection with the highest accuracy rate in the n selections as the optimal selection, and marking the access data of the optimal selection as abnormal access data.

Optionally, in any one of the iterative training, the discrimination conditions further include: the accuracy of the optimally selected access data is not less than a preset threshold value; and, the method further comprises: in any one iterative training, if the abnormal probability and accuracy of the access data in the current verification set do not meet the discrimination conditions, stopping the iterative training.

Optionally, the initial training set comprises: the method comprises the steps of marking abnormal access data in advance, determining normal access data from unlabeled access data by using an unsupervised learning model established in advance, and mutually exclusive between an initial training set and an initial verification set; the abnormal access data detection model is a gradient iteration decision tree GBDT model and a logistic regression LR model; in any iteration training, leaf nodes of GBDT models are subjected to one-time thermal coding and then serve as input data of an LR model.

To achieve the above object, according to another aspect of the present invention, there is provided a detection apparatus of abnormal access data.

The device for detecting abnormal access data according to the embodiment of the invention can comprise: the initial training unit is used for training the abnormal access data detection model according to a pre-established initial training set; a probability determination unit for: after the training is finished, determining the abnormal probability of access data in the initial verification set established in advance by using an abnormal access data detection model; the initial verification set comprises pre-marked abnormal access data and unmarked access data; the active learning unit is used for marking a plurality of access data with the abnormal probability and the accuracy rate in the initial verification set meeting the preset discrimination conditions as abnormal access data and adding the abnormal access data into the initial training set; training the abnormal access data detection model according to the current training set; the accuracy is the proportion of abnormal access data marked in advance in the plurality of access data; the abnormal detection unit is used for inputting the access data to be detected into the trained abnormal access data detection model, and judging whether the access data to be detected is the abnormal access data or not according to the output result.

Optionally, the apparatus may further include: an iterative training unit for: removing the plurality of access data from the initial verification set after adding the plurality of access data to the initial training set; performing at least one iteration training on the abnormal access data detection model; wherein, in any one iterative training: training the abnormal access data detection model according to the current training set; after the training is finished, determining the abnormal probability of the access data in the current verification set by using an abnormal access data detection model; marking a plurality of access data with abnormal probability and accuracy rate meeting the discrimination conditions in the current verification set as abnormal access data, and adding the current training set to form a training set of the next iterative training; the plurality of access data is removed from the current validation set to form a validation set for a next iterative training.

Optionally, in any one of the iterative training, the discrimination conditions may include: selecting access data from the current verification set in a put-back manner n times according to the abnormal probability descending order of the access data and the preset initial quantity and the increment quantity; wherein n is a positive integer; determining the selection with the highest accuracy rate in the n selections as the optimal selection, and marking the access data of the optimal selection as abnormal access data; in any one of the iterative training, the discrimination conditions may further include: the accuracy of the optimally selected access data is not less than a preset threshold value; the iterative training unit may be further configured to: in any iteration training, if the abnormal probability and accuracy of the access data in the current verification set do not meet the discrimination conditions, stopping the iteration training; the initial training set comprises: the method comprises the steps of marking abnormal access data in advance, determining normal access data from unlabeled access data by using an unsupervised learning model established in advance, and mutually exclusive between an initial training set and an initial verification set; the abnormal access data detection model is a gradient iteration decision tree GBDT model and a logistic regression LR model; in any iteration training, leaf nodes of GBDT models are subjected to one-time thermal coding and then serve as input data of an LR model.

To achieve the above object, according to still another aspect of the present invention, there is provided an electronic apparatus.

An electronic apparatus of the present invention includes: one or more processors; and the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize the detection method of the abnormal access data.

To achieve the above object, according to still another aspect of the present invention, there is provided a computer-readable storage medium.

A computer-readable storage medium of the present invention has stored thereon a computer program which, when executed by a processor, implements the method of detecting abnormal access data provided by the present invention.

According to the technical scheme of the invention, one embodiment of the invention has the following advantages or beneficial effects:

Firstly, normal access data is determined from unlabeled access data by using an unsupervised learning model, the normal access data and the pre-labeled abnormal access data are combined into an initial training set to carry out iterative training on an abnormal access data detection model, so that a semi-supervised learning mechanism is realized, and the labeled access data and the unlabeled access data can be effectively utilized.

And secondly, the abnormal access data detection model adopts a gradient iteration decision tree GBDT model and a logistic regression LR model, wherein leaf nodes of the GBDT model are used as input data of the LR model after being subjected to single-heat coding, and the abnormal access data detection model has good nonlinear mapping capability and strong prediction retrospective capability.

Thirdly, the invention can utilize the training set and the verification set which are dynamically adjusted to carry out iterative training of a plurality of active learning modes on the abnormal access data detection model; in each iterative training, determining the abnormal probability of access data in the current verification set by using an abnormal access data detection model trained by the current training set, selecting a plurality of access data with the maximum abnormal probability for a plurality of times, and determining the selection with the highest accuracy as the optimal selection; when the optimal selection accuracy rate is not smaller than a preset threshold value, the optimal selection access data is removed from the current verification set and added into the current training set for the next iteration training; in the next iterative training, the abnormal access data detection model which is trained before can be inherited (namely, the starting mode of the abnormal access data detection model is hot starting), so that the forgetting of knowledge caused by the existing cold starting mode is avoided; and stopping iterative training when the optimal selection accuracy is smaller than a preset threshold value, so as to complete the training process of the abnormal access data detection model. Through the arrangement, under the condition that the quality of training data is ensured, high-quality unlabeled access data can be extracted from the initial verification set to the maximum extent for iterative training of the model, so that the abnormality detection accuracy of the model is improved.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of main steps of a method for detecting abnormal access data according to an embodiment of the present invention;

FIG. 2 is a diagram of normal access data and abnormal access data according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a training process of an abnormal access data detection model in an embodiment of the present invention;

FIG. 4 is a schematic diagram of an anomaly detection flow of an anomaly access data detection model in an embodiment of the present invention;

FIG. 5 is a schematic diagram of the components of a device for detecting abnormal access data in accordance with an embodiment of the present invention;

FIG. 6 is an exemplary system architecture diagram to which embodiments in accordance with the present invention may be applied;

Fig. 7 is a schematic diagram of an electronic device for implementing a method for detecting abnormal access data in an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments of the present invention and the technical features in the embodiments may be combined with each other without collision.

Fig. 1 is a schematic diagram of main steps of a method for detecting abnormal access data according to an embodiment of the present invention.

As shown in fig. 1, the method for detecting abnormal access data according to the embodiment of the present invention may specifically be performed according to the following steps:

step S101: and training the abnormal access data detection model according to a pre-established initial training set.

In the embodiment of the invention, the abnormal access data detection model can adopt a supervised learning algorithm such as random forest, logistic regression LR (Logistic Regression) and the like, and the initial training set for carrying out initial training on the abnormal access data detection model comprises abnormal access data serving as a positive sample and normal access data serving as a negative sample.

FIG. 2 is a diagram of normal access data and abnormal access data according to an embodiment of the present invention. As shown in fig. 2, the normal traffic profile (i.e., the profile of the normal access data) can be obtained by counting the format of the normal samples (i.e., the normal access data), so that the abnormal access data (i.e., the data marked with x) that do not conform to the profile can be locked. As can be seen from fig. 2, the character regularity of the abnormal access data is weaker than that of the normal access data.

In practical application, the abnormal access data can be obtained in advance by means of manual labeling and the like. Because the number of the pre-marked abnormal access data is often smaller, in order to achieve a better training effect, required data needs to be obtained from a large number of unlabeled access data for model training. Preferably, an unsupervised learning anomaly detection algorithm in the prior art, such as a matrix decomposition algorithm and an isolated forest algorithm, can be utilized to detect unlabeled access data, and the required data is extracted according to the detection result. The input access data may be selected from a plurality of statistical features, and/or a plurality of text content features during the model training and detection phase. Wherein the statistical features may include one or more of: the number of parameters of the uniform resource locator URL (Uniform Resource Locator), the mean and variance of the lengths of the parameter values, the distribution of the parameter characters, the access frequency of the URL and the like, and the text content characteristics can comprise one or more of the following: word frequency, syntactic consistency, information entropy, etc. Since the above-mentioned feature extraction method is a known technology, the description thereof will not be repeated here.

For example, the isolated forest algorithm may be used to detect a plurality of unlabeled access data, obtain an anomaly score for each unlabeled access data (a higher anomaly score indicates that the access data is more likely to be anomalous), and then determine the previous plurality of access data (i.e., the plurality of access data with the smallest anomaly score) as normal access data by arranging the unlabeled access data in ascending order of anomaly scores. It will be appreciated that the normal access data determined by the above method is only the normal data predicted by the orphan forest algorithm and having a high likelihood, and is not necessarily objective normal data. And then, the obtained normal access data and abnormal access data can be combined into an initial training set for initial training of an abnormal access data detection model. Through the arrangement, marked access data and unmarked access data can be effectively utilized, and a training mechanism based on semi-supervised learning is realized.

As a preferred approach, the anomaly access data detection model may employ a model that combines the gradient iterative decision tree GBDT (Gradient Boosting Decision Tree) algorithm and the logistic regression LR algorithm. During the initial training process, an algorithm model is first trained GBDT in a supervised learning manner using an initial training set to combine valid anomaly detection features (each leaf node represents a combined feature). And then carrying out One-Hot Encoding (One-Hot Encoding) on each leaf node of each GBDT tree, and finally inputting the sparse features formed after Encoding into an LR algorithm model for training. Therefore, the abnormal access data detection model has good nonlinear mapping capability and strong prediction retrospective capability, and meanwhile effective combination features can be extracted. The manner of unicoding GBDT leaf nodes may be as follows: if GBDT has two trees in total, and the leaf nodes of the two trees are A, B, C, D respectively, the four leaf nodes can be converted into the following independent heat coding data respectively: (1, 0), (0, 1, 0) (0, 1, 0), (0, 1). It should be noted that, the above algorithm is only a preferred implementation manner of the abnormal access data detection model, and in practical application, the abnormal access data detection model may use other applicable supervised learning algorithms.

Step S102: after the training is finished, the abnormal probability of access data in the initial verification set established in advance is determined by using an abnormal access data detection model.

In this step, "the training" refers to the initial training performed in step S101. After the initial training is finished, the abnormal probability of the access data, namely the probability that the access data is abnormal access data, can be judged by using the abnormal access data detection model. In the embodiment of the invention, the abnormal probability of each access data in the initial verification set can be detected by using an abnormal access data detection model. The initial verification set is pre-established and is used for providing high-quality data to be added into the training set so as to realize dynamic updating of the training set, and finally an active learning training mechanism of the abnormal access data detection model is established. The initial verification set contains pre-marked abnormal access data and unmarked access data, and the initial verification set is mutually exclusive with the initial training set, namely the two sets do not have the same access data.

Step S103: marking a plurality of access data with the abnormal probability and the accuracy rate meeting preset discrimination conditions in the initial verification set as abnormal access data, and adding the abnormal access data into an initial training set; and training the abnormal access data detection model according to the current training set.

In this step, the accuracy refers to the proportion of the abnormal access data marked in advance in the access data of the verification set (including the initial verification set and the current verification set in each iteration training to be described later), and the current training set refers to the training set formed by adding the data in the initial verification set in the initial training set. The model training based on the current training set is inheritance training, namely, the training object is the model subjected to the previous training process (that is, the starting mode of the model is hot start), and the training completely inherits the data rule and knowledge acquired by the previous training.

In a specific application, the discrimination conditions may be preset and the following steps are performed to extract the access data to be added to the initial training set from the initial verification set. It can be understood that the discrimination conditions embodied in the following steps are only preferable, and do not limit the setting of the discrimination conditions, and in a specific application scenario, the present invention can flexibly set the discrimination conditions to extract the required data.

1. The access data is selected from the initial verification set in a descending order according to the abnormal probability of the access data, and the access data is replaced by a preset initial number k and an increment number s for n times. Wherein k, s and n are positive integers. That is to say: the access data in the initial verification set are arranged in descending order of anomaly probability, and the access data are selected n times in the order from front to back (namely, the order of the anomaly probability from large to small). Each selection is a place-back selection, and each selection can select a plurality of access data with the largest abnormal probability in the initial verification set. In addition, the number of access data selected for the first time is k, and the number of subsequent selections is increased by s on the basis of k, namely, the number of selections for the first time is k, the number of selections for the second time is k+s … …, and the number of selections for the nth time is k+ (n-1) s.

2. Calculating the accuracy of access data selected each time in n selections, determining the selection with the highest accuracy as the optimal selection, marking all the access data selected optimally as abnormal data as positive samples, and adding the abnormal data as the positive samples into an initial training set for the next training; the optimally selected access data may then be removed from the initial verification set. In a specific application, if the selection with the highest accuracy rate in the n selections is not one, the selection with the highest access data in the n selections can be determined as the optimal selection. Preferably, in the embodiment of the present invention, after determining the optimal selection, it may further be determined whether the accuracy of the access data of the optimal selection is less than a preset threshold: if so, the optimal selection data quality is poor, and the whole model training process is ended; otherwise, the data optimally selected is high-quality data, which can be removed from the initial verification set and added into the initial training set.

Through the steps, the high-quality data in the initial verification set can be extracted into the initial training set for the next training, and effective utilization of unlabeled access data is achieved. And repeating the steps, namely performing at least one iteration training on the abnormal access data detection model by utilizing the continuously updated training set and the continuously updated verification set, so as to realize a training mechanism of an active learning mode. The iterative training refers to an inheritance model training process performed according to a training set and a verification set which are updated and iterated continuously, and the initial training can be used as first iterative training. It can be appreciated that the training set and the verification set used by the currently performed iterative training are generated for the previous iterative training, and the training set and the verification set required for the next iterative training can be generated when the currently performed iterative training is finished. The conditions for stopping the iterative training are as follows: the accuracy of the optimally selected access data is smaller than a preset threshold, and at the moment, the training of the abnormal access data detection model is completed, so that the abnormal detection of the access data to be detected can be carried out.

In one embodiment, the steps of performing any one iterative training are as follows:

1. Performing inheritance training on the current abnormal access data detection model according to the current training set; after the training is finished, the abnormal probability of the access data in the current verification set is determined by using an abnormal access data detection model.

2. Marking a plurality of access data with abnormal probability and accuracy meeting discrimination conditions in the current verification set as abnormal access data, and adding the current training set to form a training set of next iterative training; the plurality of access data is removed from the current validation set to form a validation set for a next iterative training. In one embodiment, firstly, according to the abnormal probability descending order of the access data, the access data is selected from the initial verification set in n times of initial quantity and increment quantity in a put-back way; then calculating the accuracy of the access data selected each time in n selections, determining the selection with the highest accuracy as the optimal selection, and judging whether the accuracy of the optimal selection data is smaller than a preset threshold value or not: if yes, ending the whole model training process; otherwise, the access data which is selected optimally is removed from the current verification set, the access data is taken as a positive sample to be added into the current training set, and the updated training set and verification set are used for the next iterative training.

FIG. 3 is a schematic diagram of a training process of an abnormal access data detection model in an embodiment of the present invention. In the embodiment of the present invention, as shown in fig. 3, first, normal access data is determined from unlabeled access data by using an isolated forest, then an initial training set composed of the normal access data and pre-labeled abnormal access data is used to train an abnormal access data detection model, and access data in the current verification set is selected for multiple times; after determining the optimal selection, judging whether the accuracy of the access data of the optimal selection is smaller than a preset threshold value: if yes, stopping training; otherwise, the access data which is optimally selected is removed from the current verification set, and the current training set is added for the next iteration training. When the condition that the iterative training is stopped (i.e. the accuracy of the optimal selected data is smaller than the preset threshold value) is not reached, the iterative training process can be continuously executed to continuously improve the abnormal detection accuracy of the model.

Step S104: and inputting the access data to be detected into the abnormal access data detection model after training is completed, and judging whether the access data to be detected is the abnormal access data or not according to the output result.

It will be appreciated that the training completion in this step refers to the completion of the entire iterative training process, at which time the abnormal access data detection model can be applied to the detection of abnormal access data in actual operation.

FIG. 4 is a schematic diagram of an anomaly detection flow of an anomaly access data detection model in an embodiment of the present invention. As shown in fig. 4, when performing anomaly detection on access data to be detected, firstly inputting the access data to be detected into a GBDT model, then performing one-time thermal coding on a leaf node triggered by the access data to be detected, inputting the one-time thermal coding into an LR model to obtain an output result (namely, anomaly probability), and if the anomaly probability is greater than a preset critical value, determining the access data to be detected as the anomaly access data; and if the abnormal probability is smaller than or equal to the critical value, determining the access data to be detected as normal access data. It should be noted that, the above critical value may be the minimum anomaly probability of the access data in the last optimal selection in which the accuracy rate in the iterative training is not less than the preset threshold. For example, if the abnormal access data detection model reaches the condition that the iterative training is stopped after three iterative training (including the initial training), the optimal selection performed in the second iterative training is "the last optimal selection with accuracy not less than the preset threshold", and if the minimum abnormal probability in the access data of the optimal selection is 0.4, 0.4 may be used as the above threshold. It is understood that the GBDT and LR combined model described above is exemplary only, and that other suitable models may be employed for anomaly access data detection models in particular applications.

In the technical scheme of the embodiment of the invention, the normal access data is determined from the unlabeled access data by using the non-supervision learning model, the normal access data and the pre-labeled abnormal access data are combined into the initial training set to carry out iterative training on the abnormal access data detection model, so that a semi-supervision learning mechanism is realized, and the labeled access data and the unlabeled access data can be effectively utilized. In addition, the training set and the verification set which are dynamically adjusted can be utilized to perform iterative training of the abnormal access data detection model in a multiple active learning mode; in each iterative training, determining the abnormal probability of access data in the current verification set by using an abnormal access data detection model trained by the current training set, selecting a plurality of access data with the maximum abnormal probability for a plurality of times, and determining the selection with the highest accuracy as the optimal selection; when the optimal selection accuracy rate is not smaller than a preset threshold value, the optimal selection access data is removed from the current verification set and added into the current training set for the next iteration training; in the next iterative training, the abnormal access data detection model which is trained before can be inherited, so that the forgetting of knowledge caused by the existing cold starting mode is avoided; and stopping iterative training when the optimal selection accuracy is smaller than a preset threshold value, so as to complete the training process of the abnormal access data detection model. Through the arrangement, under the condition that the quality of training data is ensured, high-quality unlabeled access data can be extracted from the initial verification set to the maximum extent for iterative training of the model, so that the abnormality detection accuracy of the model is improved.

It should be noted that, for the convenience of description, the foregoing method embodiments are expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present invention is not limited by the described order of actions, and some steps may actually be performed in other order or simultaneously. Moreover, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts and modules referred to are not necessarily required to practice the invention.

In order to facilitate better implementation of the above-described aspects of embodiments of the present invention, the following provides related devices for implementing the above-described aspects.

Referring to fig. 5, an apparatus 500 for detecting abnormal access data according to an embodiment of the present invention may include: an initial training unit 501, a probability determination unit 502, an active learning unit 503, and an abnormality detection unit 504.

The initial training unit 501 may be configured to train the abnormal access data detection model according to a pre-established initial training set.

The probability determination unit 502 may be configured to: after the training is finished, determining the abnormal probability of access data in the initial verification set established in advance by using an abnormal access data detection model; the initial verification set comprises pre-marked abnormal access data and unmarked access data.

The active learning unit 503 may be configured to label, as abnormal access data, a plurality of access data whose initial verification set has an abnormal probability and an accuracy rate that meet preset discrimination conditions, and add the abnormal access data to the initial training set; training the abnormal access data detection model according to the current training set; the accuracy is the proportion of abnormal access data marked in advance in the plurality of access data.

The anomaly detection unit 504 may be configured to input the access data to be detected into the trained anomaly access data detection model, and determine whether the access data to be detected is the anomaly access data according to the output result.

In some embodiments, the device 500 may further comprise an iterative training unit for: removing the plurality of access data from the initial verification set after adding the plurality of access data to the initial training set; performing at least one iteration training on the abnormal access data detection model; wherein, in any one iterative training: training the abnormal access data detection model according to the current training set; after the training is finished, determining the abnormal probability of the access data in the current verification set by using an abnormal access data detection model; marking a plurality of access data with abnormal probability and accuracy rate meeting the discrimination conditions in the current verification set as abnormal access data, and adding the current training set to form a training set of the next iterative training; the plurality of access data is removed from the current validation set to form a validation set for a next iterative training.

In a specific application, in any one iterative training, the discrimination conditions include: selecting access data from the current verification set in a put-back manner n times according to the abnormal probability descending order of the access data and the preset initial quantity and the increment quantity; wherein n is a positive integer; and determining the selection with the highest accuracy rate in the n selections as the optimal selection, and marking the access data of the optimal selection as abnormal access data.

As a preferred aspect, in any one of the iterative training, the discrimination conditions may further include: the accuracy of the optimally selected access data is not less than a preset threshold value; the iterative training unit may be further configured to: in any one iterative training, if the abnormal probability and accuracy of the access data in the current verification set do not meet the discrimination conditions, stopping the iterative training.

Preferably, in an embodiment of the present invention, the initial training set includes: the method comprises the steps of marking abnormal access data in advance, determining normal access data from unlabeled access data by using an unsupervised learning model established in advance, and mutually exclusive between an initial training set and an initial verification set; the abnormal access data detection model can be a gradient iteration decision tree GBDT model and a logistic regression LR model; in any iteration training, leaf nodes of GBDT models are subjected to one-time thermal coding and then serve as input data of an LR model.

Fig. 6 illustrates an exemplary system architecture 600 to which the method of detecting abnormal access data or the apparatus for detecting abnormal access data of the embodiment of the present invention can be applied.

As shown in fig. 6, a system architecture 600 may include terminal devices 601, 602, 603, a network 604, and a server 605 (this architecture is merely an example, and the components contained in a particular architecture may be tailored to the application specific case). The network 604 is used as a medium to provide communication links between the terminal devices 601, 602, 603 and the server 605. The network 604 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 605 via the network 604 using the terminal devices 601, 602, 603 to receive or send messages, etc. Various communication client applications such as an abnormality detection class application, a web browser application, etc. (for example only) may be installed on the terminal devices 601, 602, 603.

The terminal devices 601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 605 may be a server providing various services, such as a background server (by way of example only) providing support for anomaly detection class applications operated by users using the terminal devices 601, 602, 603. The background server may process the received abnormal access data detection request and feed back the processing result (e.g. the detected abnormal access data—only by way of example) to the terminal devices 601, 602, 603.

It should be noted that, the method for detecting abnormal access data provided in the embodiment of the present invention is generally executed by the server 605, and accordingly, the device for detecting abnormal access data is generally disposed in the server 605.

It should be understood that the number of terminal devices, networks and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The invention also provides electronic equipment. The electronic equipment of the embodiment of the invention comprises: one or more processors; and the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize the detection method of the abnormal access data.

Referring now to FIG. 7, there is illustrated a schematic diagram of a computer system 700 suitable for use in implementing an electronic device of an embodiment of the present invention. The electronic device shown in fig. 7 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the invention.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU) 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the computer system 700 are also stored. The CPU701, ROM 702, and RAM703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input section 706 including a keyboard, a mouse, and the like; an output portion 707 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 708 including a hard disk or the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. The drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 710 as necessary, so that a computer program read out therefrom is installed into the storage section 708 as necessary.

In particular, the processes described in the main step diagrams above may be implemented as computer software programs according to the disclosed embodiments of the invention. For example, embodiments of the present invention include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the main step diagrams. In the above-described embodiment, the computer program can be downloaded and installed from a network through the communication section 709 and/or installed from the removable medium 711. The above-described functions defined in the system of the present invention are performed when the computer program is executed by the central processing unit 701.

The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, a computer readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with computer readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present invention may be implemented in software or in hardware. The described units may also be provided in a processor, for example, described as: a processor includes an initial training unit, a probability determination unit, a dynamic learning unit, and an anomaly detection unit. Wherein the names of the units do not constitute a limitation of the unit itself in some cases, for example, the initial training unit may also be described as "a unit that provides the probability determination unit with the initially trained abnormal access data detection model".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by the device, cause the device to perform steps comprising: training the abnormal access data detection model according to a pre-established initial training set; after the training is finished, determining the abnormal probability of access data in the initial verification set established in advance by using an abnormal access data detection model; the initial verification set comprises pre-marked abnormal access data and unmarked access data; marking a plurality of access data with the abnormal probability and the accuracy rate meeting preset discrimination conditions in the initial verification set as abnormal access data, and adding the abnormal access data into an initial training set; training the abnormal access data detection model according to the current training set; the accuracy is the proportion of abnormal access data marked in advance in the plurality of access data; and inputting the access data to be detected into the abnormal access data detection model after training is completed, and judging whether the access data to be detected is the abnormal access data or not according to an output result.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for detecting abnormal access data, comprising:

training the abnormal access data detection model according to a pre-established initial training set;

After the training is finished, determining the abnormal probability of access data in the initial verification set established in advance by using an abnormal access data detection model; the initial verification set comprises pre-marked abnormal access data and unmarked access data;

Marking a plurality of access data with the abnormal probability and the accuracy rate meeting preset discrimination conditions in the initial verification set as abnormal access data, and adding the abnormal access data into an initial training set; training the abnormal access data detection model according to the current training set; the accuracy is the proportion of abnormal access data marked in advance in the plurality of access data; and

And inputting the access data to be detected into the abnormal access data detection model after training is completed, and judging whether the access data to be detected is the abnormal access data or not according to an output result.

2. The method according to claim 1, wherein the method further comprises:

Removing the plurality of access data from the initial verification set after adding the plurality of access data to the initial training set; performing at least one iteration training on the abnormal access data detection model; wherein, in any one iterative training:

training the abnormal access data detection model according to the current training set; after the training is finished, determining the abnormal probability of the access data in the current verification set by using an abnormal access data detection model;

Marking a plurality of access data with abnormal probability and accuracy rate meeting the discrimination conditions in the current verification set as abnormal access data, and adding the current training set to form a training set of the next iterative training; the plurality of access data is removed from the current validation set to form a validation set for a next iterative training.

3. The method of claim 2, wherein in any one iterative training, the discrimination conditions include:

selecting access data from the current verification set in a put-back manner n times according to the abnormal probability descending order of the access data and the preset initial quantity and the increment quantity; wherein n is a positive integer;

and determining the selection with the highest accuracy rate in the n selections as the optimal selection, and marking the access data of the optimal selection as abnormal access data.

4. A method according to claim 3, wherein in any one iterative training, the discrimination conditions further comprise: the accuracy of the optimally selected access data is not less than a preset threshold value; and, the method further comprises:

in any one iterative training, if the abnormal probability and accuracy of the access data in the current verification set do not meet the discrimination conditions, stopping the iterative training.

5. The method according to any one of claims 1 to 4, wherein,

The initial training set comprises: the method comprises the steps of marking abnormal access data in advance, determining normal access data from unlabeled access data by using an unsupervised learning model established in advance, and mutually exclusive between an initial training set and an initial verification set; and

The abnormal access data detection model is a gradient iteration decision tree GBDT model and a logistic regression LR model; in any iteration training, leaf nodes of GBDT models are subjected to one-time thermal coding and then serve as input data of an LR model.

6. A detection apparatus for abnormal access data, comprising:

The initial training unit is used for training the abnormal access data detection model according to a pre-established initial training set;

A probability determination unit for: after the training is finished, determining the abnormal probability of access data in the initial verification set established in advance by using an abnormal access data detection model; the initial verification set comprises pre-marked abnormal access data and unmarked access data;

the active learning unit is used for marking a plurality of access data with the abnormal probability and the accuracy rate in the initial verification set meeting the preset discrimination conditions as abnormal access data and adding the abnormal access data into the initial training set; training the abnormal access data detection model according to the current training set; the accuracy is the proportion of abnormal access data marked in advance in the plurality of access data; and

The abnormal detection unit is used for inputting the access data to be detected into the trained abnormal access data detection model, and judging whether the access data to be detected is the abnormal access data or not according to the output result.

7. The apparatus of claim 6, wherein the apparatus further comprises:

An iterative training unit, configured to: removing the plurality of access data from the initial verification set after adding the plurality of access data to the initial training set; performing at least one iteration training on the abnormal access data detection model; wherein, in any one iterative training: training the abnormal access data detection model according to the current training set; after the training is finished, determining the abnormal probability of the access data in the current verification set by using an abnormal access data detection model; marking a plurality of access data with abnormal probability and accuracy rate meeting the discrimination conditions in the current verification set as abnormal access data, and adding the current training set to form a training set of the next iterative training; the plurality of access data is removed from the current validation set to form a validation set for a next iterative training.

8. The apparatus of claim 7, wherein the device comprises a plurality of sensors,

In any one of the iterative training, the discrimination conditions include: selecting access data from the current verification set in a put-back manner n times according to the abnormal probability descending order of the access data and the preset initial quantity and the increment quantity; wherein n is a positive integer; determining the selection with the highest accuracy rate in the n selections as the optimal selection, and marking the access data of the optimal selection as abnormal access data;

In any one of the iterative training, the discrimination conditions further include: the accuracy of the optimally selected access data is not less than a preset threshold value;

The iterative training unit is further configured to: in any iteration training, if the abnormal probability and accuracy of the access data in the current verification set do not meet the discrimination conditions, stopping the iteration training;

9. An electronic device, comprising:

One or more processors;

Storage means for storing one or more programs,

When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-5.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-5.