CN112635031B

CN112635031B - Data volume anomaly detection method, device, storage medium and equipment

Info

Publication number: CN112635031B
Application number: CN202011478233.4A
Authority: CN
Inventors: 许朝
Original assignee: Beijing Yiyiyun Technology Co ltd
Current assignee: Beijing Yiyiyun Technology Co ltd
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2023-08-29
Anticipated expiration: 2040-12-15
Also published as: CN112635031A

Abstract

The application discloses an anomaly detection method of data volume, which is characterized in that aiming at a business type, sample data of N batches are collected, the duration of each batch is T, and the sample data of each batch comprises the data volume of a basic table and the data volumes of all non-basic tables; n is a positive integer, and T is greater than zero; for any non-basic table, according to the data quantity of the non-basic table and the data quantity of the basic table, counting the coefficient of the non-basic table corresponding to each batch; according to the coefficients of N batches corresponding to the non-basic table, calculating the maximum coefficient and the minimum coefficient corresponding to the non-basic table; calculating the predicted maximum data volume and the predicted minimum data volume of the non-basic table corresponding to the (n+1) th batch according to the maximum coefficient and the minimum coefficient; and detecting whether the data volume of the non-basic table in the (n+1) th batch is abnormal or not according to the predicted maximum data volume and the predicted minimum data volume.

Description

Data volume anomaly detection method, device, storage medium and equipment

Technical Field

The present application relates to data processing technologies, and in particular, to a method, an apparatus, a storage medium, and a device for detecting anomalies in data volumes.

Background

In the field of medical data management, data is collected at a hospital deployment client (abbreviated as hospital), and the hospital uploads the data to a system, where the data is managed. Based on this, it is required to monitor and evaluate rationality of the data volume uploaded by the hospital department to ensure stable transmission without loss and repetition of medical data.

The traditional medical data quality control mode mainly provides prediction indexes for manual experience quality control and hospital end manufacturers, and completely utilizes three modes of artificial intelligence. However, the manual experience quality control mode has the problems of high quality control cost, fuzzy standard, incapacity of quantification, low quality control precision and the like, and the hospital end manufacturer provides the prediction index to have the problems of low quality control precision and the like, and the technical input and the technical requirement are high by completely using the artificial intelligence mode.

Therefore, a data quality control mode which is simple and easy to maintain, low in cost and high in quality control precision is needed.

Disclosure of Invention

The application provides a data volume anomaly detection method and device, which at least solve the technical problems in the prior art.

In one aspect, the present application provides a method for detecting anomalies in a data volume, the method being applied to a data system comprising at least one service type, each service type having a base table and at least one non-base table, the method comprising:

for a business type, collecting sample data of N batches, wherein the duration of each batch is T, and the sample data of each batch comprises the data volume of a basic table and the data volumes of all non-basic tables; the N is a positive integer, and the T is larger than zero;

for any non-basic table, according to the data quantity of the non-basic table and the data quantity of the basic table, counting the coefficient of the non-basic table corresponding to each batch;

according to the coefficients of N batches corresponding to the non-basic table, calculating the maximum coefficient and the minimum coefficient corresponding to the non-basic table;

calculating the predicted maximum data volume and the predicted minimum data volume of the non-basic table corresponding to the (n+1) th batch according to the maximum coefficient and the minimum coefficient;

and detecting whether the data volume of the non-basic table in the (n+1) th batch is abnormal or not according to the predicted maximum data volume and the predicted minimum data volume.

Wherein, in the basic table and the non-basic table, each table contains at least one record, and the data quantity is the record number contained in the table;

the basic table is used for recording basic data of users, and each record corresponds to a unique user identifier; after the user's basic data is generated, the non-basic table is used to record the user's generated associated data.

Wherein, according to the data amount of the non-basic table and the data amount of the basic table, the statistics of the coefficient of the non-basic table corresponding to each batch includes:

for any batch, the non-base table corresponds to the batch with the coefficients: the ratio of the amount of data in the batch for the non-base table to the amount of data in the batch for the base table.

Wherein, the calculating the predicted maximum data amount and the predicted minimum data amount of the non-basic table corresponding to the (n+1) th batch comprises:

collecting the data volume of a basic table of the (n+1) th batch;

subtracting the data volume of the base table of the N batch from the data volume of the base table of the (n+1) th batch to obtain a user increment;

the predicted maximum data size of the non-base table at the n+1st lot is: the maximum coefficient is calculated by the data quantity of the nth batch and the increment of the user;

the predicted minimum data size of the non-base table at the n+1th lot is: the non-base table is the minimum coefficient at data volume + user delta for the nth lot.

Wherein the detecting whether the data amount of the non-basic table in the (n+1) th batch is abnormal according to the predicted maximum data amount and the predicted minimum data amount comprises:

if the data volume of the non-basic table in the (n+1) th batch is greater than or equal to the predicted minimum data volume and less than or equal to the predicted maximum data volume, determining that the data volume of the non-basic table in the (n+1) th batch is normal, otherwise determining that the data volume is abnormal.

Wherein, the collected sample data of the N batches does not contain the data which is detected as abnormal.

If the detection result is wrong, the method further comprises, for a batch to be detected, adjusting the value of N, including:

sample data are collected through a time window, wherein the starting length of the time window is M batches, the starting position of the time window is the previous batch of the batch to be detected, and the ending position of the time window is the first M batches of the batch to be detected; when sample data are collected each time, the starting position of the time window is unchanged, and the ending position of the time window is moved forward by P batches compared with the last time of collection; the number of times of collecting sample data by adopting a time window is a preset number of times;

and calculating the error percentage corresponding to the sample data acquired each time through the time window, and taking the batch number of the sample data corresponding to the error percentage with the minimum absolute value as the value of N.

The calculating the error percentage corresponding to the sample data collected each time through the time window comprises the following steps:

for sample data acquired at any time, calculating the coefficient of the non-basic table to be detected corresponding to each batch in the sample data acquired at the time, and counting the average coefficient corresponding to the non-basic table;

calculating the predicted average data size of the non-basic table to be detected corresponding to the batch to be detected as follows: the average coefficient is calculated by the data quantity of the first batch and the increment of the user of the non-basic table to be detected;

the error percentage of the non-basic table to be detected corresponding to the sample data is calculated as follows: the ratio of the predicted average data volume to the data volume of the non-base table in the batch to be detected is reduced by 1.

In another aspect, the present application provides an anomaly detection device for data volume, the device being applied to a data system, the data system including at least one service type, each service type having a base table and at least one non-base table, the device comprising:

the collection module is used for collecting sample data of N batches aiming at a service type, the duration of each batch is T, and the sample data of each batch comprises the data volume of a basic table and the data volumes of all non-basic tables;

the calculation module is used for counting the coefficient of the non-basic table corresponding to each batch according to the data quantity of the non-basic table and the data quantity of the basic table aiming at any non-basic table; according to the coefficients of N batches corresponding to the non-basic table, calculating the maximum coefficient and the minimum coefficient corresponding to the non-basic table;

the prediction module is used for calculating the predicted maximum data amount and the predicted minimum data amount corresponding to the (n+1) th batch of the non-basic table according to the maximum coefficient and the minimum coefficient;

and the detection module is used for detecting whether the data volume of the non-basic table in the (n+1) th batch is abnormal or not according to the predicted maximum data volume and the predicted minimum data volume.

In yet another aspect, the present application provides an apparatus comprising:

one or more processors;

a storage means for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the anomaly detection method for the data volume described in any one of the above.

Still another aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the above-described anomaly detection methods for data volume.

In the abnormality detection method of the data amount described above, the detection of the data amount of the current lot depends on the data amounts of the first N lots; the method sequentially detects whether the data volume of the current batch is abnormal or not based on the data volume calculation indexes of the N batches, namely the predicted maximum data volume and the predicted minimum data volume, and in the detection process, the method does not need to depend on human experience, does not need indexes provided by manufacturers, does not have complex artificial intelligence algorithms, and achieves the purposes of releasing labor cost, improving quality control precision, improving automation degree, reducing technical requirements, reducing technical input cost and the like.

Drawings

FIG. 1 is a flow chart of a method for detecting anomalies in data amounts according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for detecting anomalies in data amounts according to another embodiment of the present application;

FIG. 3 is a schematic diagram showing a structure of an anomaly detection device for data volume according to an embodiment of the present application;

Detailed Description

In order to make the objects, features and advantages of the present application more comprehensible, the technical solutions according to the embodiments of the present application will be clearly described in the following with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the field of medical data management, existing data quality inspection modes probably include:

and (3) manual experience quality control: according to the experience of quality control personnel according to long-time tracking, each data table of each medical institution is subjected to targeted geological control through the size of the data volume increment, trend change, characteristics of the medical institutions (such as relatively similar data volume increment rate among comprehensive hospital tables, individual table increment of special hospitals being obviously different from other tables and the like) and other special data phenomena.

The hospital end manufacturer provides prediction indexes: the hospital end manufacturer provides a prediction index of the data volume of each table, and quality control personnel compares the actual data volume with the index, and the difference exceeds a threshold value to be marked as abnormal.

Completely by means of artificial intelligence: the data quantity is accurately predicted by deep learning technologies such as a neural network and the like, and full-automatic quality control can be realized.

However, the three methods have problems, for example:

the quality control mode of manual experience exists as follows:

1. the quality control cost is high; the quality control personnel are limited, if quality control is to be performed on all key forms of a plurality of medical institutions, more quality control time or more quality control personnel input are meant, and in experience, 20 medical institutions need 2-3 days to perform incremental quality control, and if data are updated frequently, the cost requirement on the quality control personnel is higher. In addition, the dependence of the manual experience quality inspection on the data of the previous batch is high, and if the data of the previous batch has problems, the 'fulcrum' of quality control is lost at this time;

2. standard is blurred and unquantifiable; the data are increased by the division point full evaluation quality control personnel of normal and abnormal, the experience dependence on the quality control personnel is high according to the self-control, and the quality control standard is changed due to the replacement of the quality control personnel, so that the problem of non-uniform quality control standard is caused;

3. the quality control precision is low; experience quality control is sensitive to the time interval of data update, and if the experience quality control is not complete in one batch (for example, 7 days of update is set, but the actual situation is often advanced or delayed due to the problems of holidays, technology and the like), the experience quality control brings a plurality of challenges to the experience of quality control personnel; meanwhile, if the batch contains factors such as holidays, disease outbreaks and the like, the data volume is influenced greatly, and the quality control precision is influenced.

The method of providing predictive indicators by the hospital end manufacturer has the following problems:

1. the "index" provided by manufacturers is often "unclear" or "principle" is not revealed, "and if errors occur, correction cannot be performed at this time;

2. manufacturers still encounter the following problems when making index determinations: if the index is determined by adopting a manual or artificial intelligence mode, the cost and the precision are also faced, so that the index provided by a common manufacturer is often inaccurate and cannot be optimized.

The way of completely resorting to artificial intelligence has the following problems:

1. the technical requirement is high; medical data is not updated in batches at equal intervals, namely, is not a simple time prediction model, and a plurality of influencing factors such as holidays, diseases and the like need to be considered, and the acquisition and the use of the factors have certain difficulties.

2. The input cost is high; the subject of quality control is each form of each medical institution, which means that each form of each medical institution needs a model to support, and requires later maintenance and update, which is very costly to develop and maintain.

In order to provide an anomaly detection method for data volume, the embodiment of the application can be used for at least solving the problems encountered in medical data quality inspection, achieving the purposes of releasing labor cost, improving quality control precision, improving automation degree, reducing technical requirements, technical input cost and the like, and optimizing portability of the solution. Of course, the anomaly detection method of the data volume in the embodiment of the application is not only suitable for quality inspection of medical data, but also suitable for quality inspection of traffic data and other data with similar characteristics to the medical data.

Fig. 1 shows a method for detecting anomalies in data volume according to an embodiment of the present application, which is applied to a data system including at least one service type, each service type having a base table and at least one non-base table. The table under each service type is detected independently, and the detection method of the table under each service type is the same, and the method comprises the following steps of:

step 101, collecting sample data of N batches, wherein the duration of each batch is T, and the sample data of each batch comprises the data volume of a basic table and the data volume of at least one non-basic table; and N is a positive integer, and T is greater than zero.

In the basic table and the non-basic table, each table contains at least one record, and in this embodiment, the data amount of each table refers to the number of records contained in the table. The basic table is used for recording basic data of a user, taking medical data as an example, if the service type is an outpatient service, the basic table can be a registration table, each record in the registration table corresponds to registration information of a patient, namely, basic data (including an outpatient service number, a name, an age, an identity card, a department and the like), and each record corresponds to a unique user identification, such as an outpatient service number; after the user's basic data is generated, the non-basic table is used to record the user-generated associated data, e.g., after patient registration, order information, medication injection information, etc., may be generated, which is recorded by the corresponding non-basic table, each record in the non-basic table also being associated with a unique user identification, i.e., an outpatient number.

It should be noted that the same user may correspond to different user identifications in different service types, for example, when the service type is hospitalization, a hospitalization number may be assigned to the patient, which is different from the clinic number. The clinic number is used for identifying the data of the clinic with the business type generated by the user, and the hospitalization number is used for identifying the data of the hospitalization with the business type generated by the user.

In addition, in the data system, data of different service types are recorded in some tables at the same time, for example, an outpatient service generates medical order information, and an inpatient service also generates medical order information, so that the medical order table includes medical order records of the outpatient service and medical order records of the inpatient service at the same time, when sample data of one service type is collected in the step, the data conforming to the service type in the table needs to be screened out for forming sample data of the table under the service type, for example, when sample data of the outpatient service is collected, medical order records of the outpatient service type in the medical order table are screened out for forming the medical order table of the outpatient service for use in subsequent detection.

In this embodiment, N batches of data may be collected at a time, where the duration corresponding to one batch of data is T, for example, T is one month and N is 6, and then the data is collected for 6 months. Wherein the sample data includes data amount of one basic table and data amount of at least one non-basic table, and if the detected target is a certain non-basic table, only the basic table and the data amount of the non-basic table may be collected. For example, the data volume of a registration table for each month and the data volume of a drug infusion table for each month of 1-6 months are collected.

Step 102, for any non-base table, counting the coefficient of the non-base table corresponding to each batch according to the data amount of the non-base table and the data amount of the base table.

For example, for a data volume of 1-6 months collected, the coefficients of the drug infusion table for the first batch, i.e., 1 month, are: data amount of drug infusion table 1 month/data amount of registration table 1 month; the coefficients of the drug infusion table for the second batch, 2 months, are: drug injection table data amount for 2 months/registration table data amount for 2 months, and so on, the coefficients of the drug injection table corresponding to 6 batches are obtained.

Step 103, according to the coefficients of the N batches corresponding to the non-basic table, calculating the maximum coefficient and the minimum coefficient corresponding to the non-basic table.

For a non-base table, after the coefficients of the non-base table corresponding to each batch are counted, N coefficients are obtained, and the maximum coefficient and the minimum coefficient are determined from the N coefficients.

Step 104, calculating the predicted maximum data amount and the predicted minimum data amount corresponding to the n+1st batch of the non-basic table according to the maximum coefficient and the minimum coefficient.

In this step, to calculate the predicted maximum data amount and the predicted minimum data amount of the n+1th lot, first, it is necessary to calculate the user increment of the n+1th lot with respect to the N-th lot: collecting the data volume of a basic table of the (n+1) th batch; subtracting the data amount of the base table of the N batch from the data amount of the base table of the N+1st batch to obtain the user increment. Because the base table is used for recording the base data of the user, and each record corresponds to a unique user identifier, it can be determined without doubt that the record number of the base table also characterizes the user number, and the user increment is the data amount of the base table of the (n+1) th batch minus the data amount of the base table of the (N) th batch.

After the user delta is obtained, the predicted maximum data amount and the predicted minimum data amount of the (n+1) th batch are calculated:

Step 105, detecting whether the data amount of the non-basic table in the (n+1) th batch is abnormal according to the predicted maximum data amount and the predicted minimum data amount.

In this embodiment, preferably, if the data amount (actually occurring) of the non-base table in the n+1th lot is greater than or equal to the predicted minimum data amount and less than or equal to the predicted maximum data amount, the data amount of the non-base table in the n+1th lot is determined to be normal, otherwise, is determined to be abnormal.

Therefore, in the detection mode, the detection of the data quantity of the current batch depends on the data quantity of the first N batches, and in the detection process, the detection does not need to depend on human experience, does not need indexes provided by manufacturers, does not have complex artificial intelligence algorithms, and achieves the purposes of releasing labor cost, improving quality control precision, improving automation degree, reducing technical requirements, technical input cost and the like.

In order to reduce the influence of abnormal data on the subsequent data detection, the data which has been detected as abnormal is not contained in the collected sample data of N batches at the next detection.

The detection method of the present application may be performed periodically or aperiodically.

The above-described data detection process of the present application is described below by way of a specific embodiment.

Taking the doctor's advice form of detecting a hospital clinic as an example, assuming that N is 5 and T is 1 month, detecting whether the data volume of the doctor's advice form of the current month 11 months is abnormal, then:

1. collecting sample data;

the registration table data volume of 5 batches in total is collected for 6-10 months and expressed as: bg6, bg7, bg8, bg9, bg10; the data volume of the order form (clinic type) of 5 batches for 6-10 months is collected and expressed as: by6, by7, by8, by9, by10.

2, calculating coefficients;

calculating coefficients X6-X10 of the order table corresponding to 5 batches, then:

X6＝By6/Bg6；

X7＝By7/Bg7；

X8＝By8/Bg8；

X9＝By9/Bg9；

X10＝By10/Bg10；

let X7 be the maximum coefficient and X10 be the minimum coefficient.

3. Calculating the predicted maximum data volume and the predicted minimum data volume of the 11-month medical advice table

User delta w=bg11-bg10; wherein Bg11 is the data volume of a registration list generated in 11 months;

predicting maximum data amount By11 _max ＝By10+w*X7；

Predicting minimum data quantity By11 _min ＝By10+w*X10。

4. Determining whether the data quantity By11 of the doctor's advice table of 11 months is abnormal

If By11 _min ≤By11≤By11 _max Determining that By11 is normal; otherwise, by11 anomaly is determined.

5. Automatic optimization

When the next detection period arrives, for example, whether the order table of 12 months is abnormal or not is detected, if By11 is already detected as abnormal, the data of the order table of 12 months is detected repeatedly, and then the abnormality detection is carried out on the data of the order table of 12 months, wherein the data of 11 months is not included when the sample data is acquired, the data of the registered table of 5 batches for 6-10 months and the data of the order table (clinic type) of 5 batches for 6-10 months are still acquired.

In addition, if the detection result is incorrect, for example, the actual data amount is normal, but the detection is abnormal, or the detection is abnormal, but the detection is normal, the detection method needs to be optimized, for example, the value of N is adjusted, as shown in fig. 2, including:

step 201, collecting sample data through a time window, wherein the starting length of the time window is M batches, the duration of each batch is T, the starting position of the time window is the previous batch of the batch to be detected, and the ending position of the time window is the previous M batches of the batch to be detected; when sample data are collected each time, the starting position of the time window is unchanged, and the ending position of the time window is moved forward by P batches compared with the last time of collection; the number of times sample data is collected using a time window is a predetermined number of times.

Step 202, calculating an error percentage corresponding to the sample data collected each time through the time window, and taking the batch number of the sample data corresponding to the error percentage with the minimum absolute value as the value of N.

The calculating of the error percentage corresponding to the sample data collected each time comprises the following steps:

calculating the coefficient of the non-basic table to be detected corresponding to each batch in the sample data acquired at the time, and counting the average coefficient corresponding to the non-basic table;

Here, the data amount may be predicted by using a linear regression prediction value, a composite growth rate evaluation value, or the like, in addition to the prediction of the data amount using the average coefficient.

When the value of N is adjusted, sample data can be collected in a time window mode, wherein the starting length of the time window is M batches, and if M is 3, the time window is adopted to collect sample data of 3 continuous batches when sample data are collected for the 1 st time, the starting position of the time window is the previous batch of the batch to be detected, the ending position is the previous M batches of the batch to be detected, if the batch to be detected is the 12 th batch, the starting position of the time window is the 11 th batch, and the ending position is the 9 th batch, namely, sample data of 11 th, 10 th and 9 continuous 3 batches are collected when sample data are collected for the 1 st time; each time sample data is collected, the starting position of the time window is unchanged (i.e. from the 11 th batch), the ending position of the time window is moved forward by P batches compared with the last collection, if P is 1, then the ending position of the time window is moved forward by1 batch compared with the ending position of the 1 st time window (9 th batch), i.e. 8 th batch, then the 11 th, 10, 9, 8 th sample data are collected continuously by 4 th batch, then the 11 th, 10, 9, 8 th sample data are collected continuously by 5 th sample data are collected continuously by 3 rd sample data collection, and so on until the predetermined times are collected.

For sample data of the first 3 batches (11 th, 10 th and 9 th total 3 batches) collected at the 1 st time, calculating coefficients of 3 batches corresponding to the non-base table, calculating average coefficients (i.e. average number of 3 coefficients) corresponding to the non-base table, and calculating predicted average data amount of the non-base table corresponding to the current batch (12 th batch) as follows: the non-base table is the data volume of the previous lot (11 th lot) plus the user delta average; then, the error percentage of the non-base table corresponding to the current lot (12 th lot) = (predicted average data amount/data amount of the non-base table at the current lot) -1 is calculated.

By analogy, the error percentages corresponding to the sample data of the first 4 batches (11, 10, 9, 8 consecutive 4 batches) and the first 5 batches (11, 10, 9, 8, 7 consecutive 5 batches) were calculated. Then, the absolute values of the three error percentages are compared, the number of batches corresponding to the length of the time window corresponding to the error percentage with the smallest absolute value is taken as the value of N, and the absolute value of the error percentages of the first 4 batches is calculated to be the smallest, so that N=4 is obtained, and the subsequent data detection is carried out.

It should be noted that:

1. the optimization operation is not easy to frequently and can be performed once every fixed time period (for example, half a year or one year), and when the optimization is performed, the current batch is the batch to be detected. Because frequent optimizations increase the risk of "overfitting";

2. the upper limit and the lower limit of the time window length need to be limited, for example, the upper limit and the lower limit are respectively 3 and 12, namely, the lower limit of the collection is 3 batches in succession, the upper limit of the collection is 12 batches in succession, and the problems of unstable detection caused by insufficient samples and low detection sensitivity caused by excessive samples are solved.

In order to implement the foregoing method for detecting anomalies in data volume, the implementation of the present application further provides a device for detecting anomalies in data volume, which is applied to a data system, where the data system includes at least one service type, each service type having a base table and at least one non-base table, as shown in fig. 3, and the device includes:

the collecting module 10 is configured to collect sample data of N batches for a service type, where a duration of each batch is T, and the sample data of each batch includes a data amount of a base table and data amounts of all non-base tables;

a calculation module 20, configured to count, for any non-base table, coefficients of the non-base table corresponding to each batch according to the data amount of the non-base table and the data amount of the base table; according to the coefficients of N batches corresponding to the non-basic table, calculating the maximum coefficient and the minimum coefficient corresponding to the non-basic table;

a prediction module 30, configured to calculate a predicted maximum data amount and a predicted minimum data amount corresponding to the n+1st lot of the non-basic table according to the maximum coefficient and the minimum coefficient;

a detecting module 40, configured to detect whether the data amount of the n+1st lot of the non-basic table is abnormal according to the predicted maximum data amount and the predicted minimum data amount.

The calculating module 20 is further configured to, when counting the coefficient of the non-base table corresponding to each batch according to the data amount of the non-base table and the data amount of the base table, calculate, for any batch, the coefficient of the non-base table corresponding to the batch as follows: the ratio of the amount of data in the batch for the non-base table to the amount of data in the batch for the base table.

The calculation of the non-base table is performed when the predicted maximum data amount and the predicted minimum data amount corresponding to the (n+1) th batch are: the prediction module 30 is further configured to collect data amounts of the base table of the n+1st lot; subtracting the data volume of the base table of the N batch from the data volume of the base table of the (n+1) th batch to obtain a user increment; the predicted maximum data size of the non-base table at the n+1st lot is: the maximum coefficient is calculated by the data quantity of the nth batch and the increment of the user; the predicted minimum data size of the non-base table at the n+1th lot is: the non-base table is the minimum coefficient at data volume + user delta for the nth lot.

The detecting module 40 is further configured to determine that the data size of the non-base table in the n+1th lot is normal when the data size of the non-base table in the n+1th lot is greater than or equal to the predicted minimum data size and less than or equal to the predicted maximum data size, or determine that the data size of the non-base table in the n+1th lot is abnormal, if not, the data size of the non-base table in the n+1th lot is abnormal, according to the predicted maximum data size and the predicted minimum data size.

If the detection result is wrong, the device further comprises an optimization module 50, configured to adjust the value of N for the batch to be detected;

the collecting module 10 is further configured to collect sample data through a time window, where a starting length of the time window is M batches, a starting position of the time window is a previous batch of the batch to be detected, and an ending position of the time window is a previous M batches of the batch to be detected; when sample data are collected each time, the starting position of the time window is unchanged, and the ending position of the time window is moved forward by P batches compared with the last time of collection; the number of times of collecting sample data by adopting a time window is a preset number of times;

the optimizing module 50 is further configured to calculate an error percentage corresponding to the sample data collected each time through the time window, and take the batch number of the sample data corresponding to the error percentage with the smallest absolute value as the value of N.

Wherein, when calculating the error percentage corresponding to the sample data collected each time:

the calculating module 20 is further configured to calculate a coefficient corresponding to each batch in the sample data collected at this time for the non-basic table to be detected, and count an average coefficient corresponding to the non-basic table;

the prediction module 30 is further configured to calculate a predicted average data size of the non-basic table to be detected corresponding to the batch to be detected as: the average coefficient is calculated by the data quantity of the first batch and the increment of the user of the non-basic table to be detected;

the optimization module 50 is further configured to calculate that the error percentage of the non-basic table to be detected corresponding to the current sample data is: the ratio of the predicted average data volume to the data volume of the non-base table in the batch to be detected is reduced by 1.

In addition, an embodiment of the present application further provides an apparatus, including:

one or more processors;

a storage means for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the anomaly detection method for the data volume described above.

Another embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described abnormality detection method for a data amount.

The computer program product may write program code for performing operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present application have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not intended to be limiting, and these advantages, benefits, effects, etc. are not to be considered as essential to the various embodiments of the present application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not necessarily limited to practice with the above described specific details.

The block diagrams of the devices, apparatuses, devices, systems referred to in the present application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

It is also noted that in the apparatus, devices and methods of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent aspects of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. An anomaly detection method for medical data volume is characterized in that the method is applied to a data system, the data system comprises at least one service type, each service type comprises a basic table and at least one non-basic table, each table comprises at least one record, and the data volume is the record number contained in the table; the basic table is used for recording basic data of users, and each record corresponds to a unique user identifier; after the basic data of the user is generated, the non-basic table is used for recording the associated data generated by the user; the method comprises the following steps:

for any one of the non-base tables, according to the data amount of the non-base table and the data amount of the base table, the coefficient of the non-base table corresponding to each batch is counted, and for any one batch, the coefficient of the non-base table corresponding to the batch is: the ratio of the amount of data of the non-base table in the batch to the amount of data of the base table in the batch;

detecting whether the data amount of the non-basic table in the (n+1) th batch is abnormal or not according to the predicted maximum data amount and the predicted minimum data amount;

the calculating the predicted maximum data amount and the predicted minimum data amount of the non-basic table corresponding to the (n+1) th batch comprises the following steps:

collecting the data volume of a basic table of the (n+1) th batch;

the predicted maximum data size of the non-base table at the n+1st lot is: the non-base table is the data size of the nth lot + the user deltaThe maximum coefficient;

the predicted minimum data size of the non-base table at the n+1th lot is: the non-base table is the data size of the nth lot + the user deltaThe minimum coefficient.

2. The method of claim 1, wherein detecting whether the data volume of the non-base table at the n+1st lot is abnormal based on the predicted maximum data volume and the predicted minimum data volume comprises:

3. The method of claim 2, wherein the collected N batches of sample data do not include data that has been detected as anomalous.

4. The method of claim 1, further comprising, for a lot to be inspected, adjusting the value of N if the inspection result is incorrect, comprising:

5. The method of claim 4, wherein calculating the percentage error for each sample data acquired through the time window comprises:

calculating the predicted average data size of the non-basic table to be detected corresponding to the batch to be detected as follows: the non-basic table to be detected is the data size of the last batch+the user incrementThe average coefficient;

6. An abnormality detection device for medical data volume, characterized in that the device is applied to a data system, the data system comprises at least one service type, each service type has a basic table and at least one non-basic table, each table contains at least one record, and the data volume is the record number contained in the table; the basic table is used for recording basic data of users, and each record corresponds to a unique user identifier; after the basic data of the user is generated, the non-basic table is used for recording the associated data generated by the user; the device comprises:

the calculation module is used for counting the coefficient of the non-basic table corresponding to each batch according to the data quantity of the non-basic table and the data quantity of the basic table, and the coefficient of the non-basic table corresponding to the batch is as follows for any batch: the ratio of the amount of data of the non-base table in the batch to the amount of data of the base table in the batch; according to the coefficients of N batches corresponding to the non-basic table, calculating the maximum coefficient and the minimum coefficient corresponding to the non-basic table;

the prediction module is configured to calculate, according to the maximum coefficient and the minimum coefficient, a predicted maximum data amount and a predicted minimum data amount corresponding to the n+1st lot in the non-basic table, where the prediction module includes: collecting the data volume of a basic table of the (n+1) th batch; subtracting the data volume of the base table of the N batch from the data volume of the base table of the (n+1) th batch to obtain a user increment; the predicted maximum data size of the non-base table at the n+1st lot is: the non-base table is the data size of the nth lot + the user deltaThe maximum coefficient; the predicted minimum data size of the non-base table at the n+1th lot is: the non-base table is the data size of the nth lot + the user deltaThe minimum coefficient;

7. An apparatus, the apparatus comprising:

one or more processors;

a storage means for storing one or more programs;

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-5.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-5.