CN112635031B - Data volume anomaly detection method, device, storage medium and equipment - Google Patents

Data volume anomaly detection method, device, storage medium and equipment Download PDF

Info

Publication number
CN112635031B
CN112635031B CN202011478233.4A CN202011478233A CN112635031B CN 112635031 B CN112635031 B CN 112635031B CN 202011478233 A CN202011478233 A CN 202011478233A CN 112635031 B CN112635031 B CN 112635031B
Authority
CN
China
Prior art keywords
data
batch
basic
data volume
coefficient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011478233.4A
Other languages
Chinese (zh)
Other versions
CN112635031A (en
Inventor
许朝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yiyiyun Technology Co ltd
Original Assignee
Beijing Yiyiyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yiyiyun Technology Co ltd filed Critical Beijing Yiyiyun Technology Co ltd
Priority to CN202011478233.4A priority Critical patent/CN112635031B/en
Publication of CN112635031A publication Critical patent/CN112635031A/en
Application granted granted Critical
Publication of CN112635031B publication Critical patent/CN112635031B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H40/00ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
    • G16H40/20ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the management or administration of healthcare resources or facilities, e.g. managing hospital staff or surgery rooms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Biomedical Technology (AREA)
  • General Business, Economics & Management (AREA)
  • Business, Economics & Management (AREA)
  • Primary Health Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Automatic Analysis And Handling Materials Therefor (AREA)

Abstract

The application discloses an anomaly detection method of data volume, which is characterized in that aiming at a business type, sample data of N batches are collected, the duration of each batch is T, and the sample data of each batch comprises the data volume of a basic table and the data volumes of all non-basic tables; n is a positive integer, and T is greater than zero; for any non-basic table, according to the data quantity of the non-basic table and the data quantity of the basic table, counting the coefficient of the non-basic table corresponding to each batch; according to the coefficients of N batches corresponding to the non-basic table, calculating the maximum coefficient and the minimum coefficient corresponding to the non-basic table; calculating the predicted maximum data volume and the predicted minimum data volume of the non-basic table corresponding to the (n+1) th batch according to the maximum coefficient and the minimum coefficient; and detecting whether the data volume of the non-basic table in the (n+1) th batch is abnormal or not according to the predicted maximum data volume and the predicted minimum data volume.

Description

Data volume anomaly detection method, device, storage medium and equipment
Technical Field
The present application relates to data processing technologies, and in particular, to a method, an apparatus, a storage medium, and a device for detecting anomalies in data volumes.
Background
In the field of medical data management, data is collected at a hospital deployment client (abbreviated as hospital), and the hospital uploads the data to a system, where the data is managed. Based on this, it is required to monitor and evaluate rationality of the data volume uploaded by the hospital department to ensure stable transmission without loss and repetition of medical data.
The traditional medical data quality control mode mainly provides prediction indexes for manual experience quality control and hospital end manufacturers, and completely utilizes three modes of artificial intelligence. However, the manual experience quality control mode has the problems of high quality control cost, fuzzy standard, incapacity of quantification, low quality control precision and the like, and the hospital end manufacturer provides the prediction index to have the problems of low quality control precision and the like, and the technical input and the technical requirement are high by completely using the artificial intelligence mode.
Therefore, a data quality control mode which is simple and easy to maintain, low in cost and high in quality control precision is needed.
Disclosure of Invention
The application provides a data volume anomaly detection method and device, which at least solve the technical problems in the prior art.
In one aspect, the present application provides a method for detecting anomalies in a data volume, the method being applied to a data system comprising at least one service type, each service type having a base table and at least one non-base table, the method comprising:
for a business type, collecting sample data of N batches, wherein the duration of each batch is T, and the sample data of each batch comprises the data volume of a basic table and the data volumes of all non-basic tables; the N is a positive integer, and the T is larger than zero;
for any non-basic table, according to the data quantity of the non-basic table and the data quantity of the basic table, counting the coefficient of the non-basic table corresponding to each batch;
according to the coefficients of N batches corresponding to the non-basic table, calculating the maximum coefficient and the minimum coefficient corresponding to the non-basic table;
calculating the predicted maximum data volume and the predicted minimum data volume of the non-basic table corresponding to the (n+1) th batch according to the maximum coefficient and the minimum coefficient;
and detecting whether the data volume of the non-basic table in the (n+1) th batch is abnormal or not according to the predicted maximum data volume and the predicted minimum data volume.
Wherein, in the basic table and the non-basic table, each table contains at least one record, and the data quantity is the record number contained in the table;
the basic table is used for recording basic data of users, and each record corresponds to a unique user identifier; after the user's basic data is generated, the non-basic table is used to record the user's generated associated data.
Wherein, according to the data amount of the non-basic table and the data amount of the basic table, the statistics of the coefficient of the non-basic table corresponding to each batch includes:
for any batch, the non-base table corresponds to the batch with the coefficients: the ratio of the amount of data in the batch for the non-base table to the amount of data in the batch for the base table.
Wherein, the calculating the predicted maximum data amount and the predicted minimum data amount of the non-basic table corresponding to the (n+1) th batch comprises:
collecting the data volume of a basic table of the (n+1) th batch;
subtracting the data volume of the base table of the N batch from the data volume of the base table of the (n+1) th batch to obtain a user increment;
the predicted maximum data size of the non-base table at the n+1st lot is: the maximum coefficient is calculated by the data quantity of the nth batch and the increment of the user;
the predicted minimum data size of the non-base table at the n+1th lot is: the non-base table is the minimum coefficient at data volume + user delta for the nth lot.
Wherein the detecting whether the data amount of the non-basic table in the (n+1) th batch is abnormal according to the predicted maximum data amount and the predicted minimum data amount comprises:
if the data volume of the non-basic table in the (n+1) th batch is greater than or equal to the predicted minimum data volume and less than or equal to the predicted maximum data volume, determining that the data volume of the non-basic table in the (n+1) th batch is normal, otherwise determining that the data volume is abnormal.
Wherein, the collected sample data of the N batches does not contain the data which is detected as abnormal.
If the detection result is wrong, the method further comprises, for a batch to be detected, adjusting the value of N, including:
sample data are collected through a time window, wherein the starting length of the time window is M batches, the starting position of the time window is the previous batch of the batch to be detected, and the ending position of the time window is the first M batches of the batch to be detected; when sample data are collected each time, the starting position of the time window is unchanged, and the ending position of the time window is moved forward by P batches compared with the last time of collection; the number of times of collecting sample data by adopting a time window is a preset number of times;
and calculating the error percentage corresponding to the sample data acquired each time through the time window, and taking the batch number of the sample data corresponding to the error percentage with the minimum absolute value as the value of N.
The calculating the error percentage corresponding to the sample data collected each time through the time window comprises the following steps:
for sample data acquired at any time, calculating the coefficient of the non-basic table to be detected corresponding to each batch in the sample data acquired at the time, and counting the average coefficient corresponding to the non-basic table;
calculating the predicted average data size of the non-basic table to be detected corresponding to the batch to be detected as follows: the average coefficient is calculated by the data quantity of the first batch and the increment of the user of the non-basic table to be detected;
the error percentage of the non-basic table to be detected corresponding to the sample data is calculated as follows: the ratio of the predicted average data volume to the data volume of the non-base table in the batch to be detected is reduced by 1.
In another aspect, the present application provides an anomaly detection device for data volume, the device being applied to a data system, the data system including at least one service type, each service type having a base table and at least one non-base table, the device comprising:
the collection module is used for collecting sample data of N batches aiming at a service type, the duration of each batch is T, and the sample data of each batch comprises the data volume of a basic table and the data volumes of all non-basic tables;
the calculation module is used for counting the coefficient of the non-basic table corresponding to each batch according to the data quantity of the non-basic table and the data quantity of the basic table aiming at any non-basic table; according to the coefficients of N batches corresponding to the non-basic table, calculating the maximum coefficient and the minimum coefficient corresponding to the non-basic table;
the prediction module is used for calculating the predicted maximum data amount and the predicted minimum data amount corresponding to the (n+1) th batch of the non-basic table according to the maximum coefficient and the minimum coefficient;
and the detection module is used for detecting whether the data volume of the non-basic table in the (n+1) th batch is abnormal or not according to the predicted maximum data volume and the predicted minimum data volume.
In yet another aspect, the present application provides an apparatus comprising:
one or more processors;
a storage means for storing one or more programs;
when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the anomaly detection method for the data volume described in any one of the above.
Still another aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the above-described anomaly detection methods for data volume.
In the abnormality detection method of the data amount described above, the detection of the data amount of the current lot depends on the data amounts of the first N lots; the method sequentially detects whether the data volume of the current batch is abnormal or not based on the data volume calculation indexes of the N batches, namely the predicted maximum data volume and the predicted minimum data volume, and in the detection process, the method does not need to depend on human experience, does not need indexes provided by manufacturers, does not have complex artificial intelligence algorithms, and achieves the purposes of releasing labor cost, improving quality control precision, improving automation degree, reducing technical requirements, reducing technical input cost and the like.
Drawings
FIG. 1 is a flow chart of a method for detecting anomalies in data amounts according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for detecting anomalies in data amounts according to another embodiment of the present application;
FIG. 3 is a schematic diagram showing a structure of an anomaly detection device for data volume according to an embodiment of the present application;
Detailed Description
In order to make the objects, features and advantages of the present application more comprehensible, the technical solutions according to the embodiments of the present application will be clearly described in the following with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In the field of medical data management, existing data quality inspection modes probably include:
and (3) manual experience quality control: according to the experience of quality control personnel according to long-time tracking, each data table of each medical institution is subjected to targeted geological control through the size of the data volume increment, trend change, characteristics of the medical institutions (such as relatively similar data volume increment rate among comprehensive hospital tables, individual table increment of special hospitals being obviously different from other tables and the like) and other special data phenomena.
The hospital end manufacturer provides prediction indexes: the hospital end manufacturer provides a prediction index of the data volume of each table, and quality control personnel compares the actual data volume with the index, and the difference exceeds a threshold value to be marked as abnormal.
Completely by means of artificial intelligence: the data quantity is accurately predicted by deep learning technologies such as a neural network and the like, and full-automatic quality control can be realized.
However, the three methods have problems, for example:
the quality control mode of manual experience exists as follows:
1. the quality control cost is high; the quality control personnel are limited, if quality control is to be performed on all key forms of a plurality of medical institutions, more quality control time or more quality control personnel input are meant, and in experience, 20 medical institutions need 2-3 days to perform incremental quality control, and if data are updated frequently, the cost requirement on the quality control personnel is higher. In addition, the dependence of the manual experience quality inspection on the data of the previous batch is high, and if the data of the previous batch has problems, the 'fulcrum' of quality control is lost at this time;
2. standard is blurred and unquantifiable; the data are increased by the division point full evaluation quality control personnel of normal and abnormal, the experience dependence on the quality control personnel is high according to the self-control, and the quality control standard is changed due to the replacement of the quality control personnel, so that the problem of non-uniform quality control standard is caused;
3. the quality control precision is low; experience quality control is sensitive to the time interval of data update, and if the experience quality control is not complete in one batch (for example, 7 days of update is set, but the actual situation is often advanced or delayed due to the problems of holidays, technology and the like), the experience quality control brings a plurality of challenges to the experience of quality control personnel; meanwhile, if the batch contains factors such as holidays, disease outbreaks and the like, the data volume is influenced greatly, and the quality control precision is influenced.
The method of providing predictive indicators by the hospital end manufacturer has the following problems:
1. the "index" provided by manufacturers is often "unclear" or "principle" is not revealed, "and if errors occur, correction cannot be performed at this time;
2. manufacturers still encounter the following problems when making index determinations: if the index is determined by adopting a manual or artificial intelligence mode, the cost and the precision are also faced, so that the index provided by a common manufacturer is often inaccurate and cannot be optimized.
The way of completely resorting to artificial intelligence has the following problems:
1. the technical requirement is high; medical data is not updated in batches at equal intervals, namely, is not a simple time prediction model, and a plurality of influencing factors such as holidays, diseases and the like need to be considered, and the acquisition and the use of the factors have certain difficulties.
2. The input cost is high; the subject of quality control is each form of each medical institution, which means that each form of each medical institution needs a model to support, and requires later maintenance and update, which is very costly to develop and maintain.
In order to provide an anomaly detection method for data volume, the embodiment of the application can be used for at least solving the problems encountered in medical data quality inspection, achieving the purposes of releasing labor cost, improving quality control precision, improving automation degree, reducing technical requirements, technical input cost and the like, and optimizing portability of the solution. Of course, the anomaly detection method of the data volume in the embodiment of the application is not only suitable for quality inspection of medical data, but also suitable for quality inspection of traffic data and other data with similar characteristics to the medical data.
Fig. 1 shows a method for detecting anomalies in data volume according to an embodiment of the present application, which is applied to a data system including at least one service type, each service type having a base table and at least one non-base table. The table under each service type is detected independently, and the detection method of the table under each service type is the same, and the method comprises the following steps of:
step 101, collecting sample data of N batches, wherein the duration of each batch is T, and the sample data of each batch comprises the data volume of a basic table and the data volume of at least one non-basic table; and N is a positive integer, and T is greater than zero.
In the basic table and the non-basic table, each table contains at least one record, and in this embodiment, the data amount of each table refers to the number of records contained in the table. The basic table is used for recording basic data of a user, taking medical data as an example, if the service type is an outpatient service, the basic table can be a registration table, each record in the registration table corresponds to registration information of a patient, namely, basic data (including an outpatient service number, a name, an age, an identity card, a department and the like), and each record corresponds to a unique user identification, such as an outpatient service number; after the user's basic data is generated, the non-basic table is used to record the user-generated associated data, e.g., after patient registration, order information, medication injection information, etc., may be generated, which is recorded by the corresponding non-basic table, each record in the non-basic table also being associated with a unique user identification, i.e., an outpatient number.
It should be noted that the same user may correspond to different user identifications in different service types, for example, when the service type is hospitalization, a hospitalization number may be assigned to the patient, which is different from the clinic number. The clinic number is used for identifying the data of the clinic with the business type generated by the user, and the hospitalization number is used for identifying the data of the hospitalization with the business type generated by the user.
In addition, in the data system, data of different service types are recorded in some tables at the same time, for example, an outpatient service generates medical order information, and an inpatient service also generates medical order information, so that the medical order table includes medical order records of the outpatient service and medical order records of the inpatient service at the same time, when sample data of one service type is collected in the step, the data conforming to the service type in the table needs to be screened out for forming sample data of the table under the service type, for example, when sample data of the outpatient service is collected, medical order records of the outpatient service type in the medical order table are screened out for forming the medical order table of the outpatient service for use in subsequent detection.
In this embodiment, N batches of data may be collected at a time, where the duration corresponding to one batch of data is T, for example, T is one month and N is 6, and then the data is collected for 6 months. Wherein the sample data includes data amount of one basic table and data amount of at least one non-basic table, and if the detected target is a certain non-basic table, only the basic table and the data amount of the non-basic table may be collected. For example, the data volume of a registration table for each month and the data volume of a drug infusion table for each month of 1-6 months are collected.
Step 102, for any non-base table, counting the coefficient of the non-base table corresponding to each batch according to the data amount of the non-base table and the data amount of the base table.
For any batch, the non-base table corresponds to the batch with the coefficients: the ratio of the amount of data in the batch for the non-base table to the amount of data in the batch for the base table.
For example, for a data volume of 1-6 months collected, the coefficients of the drug infusion table for the first batch, i.e., 1 month, are: data amount of drug infusion table 1 month/data amount of registration table 1 month; the coefficients of the drug infusion table for the second batch, 2 months, are: drug injection table data amount for 2 months/registration table data amount for 2 months, and so on, the coefficients of the drug injection table corresponding to 6 batches are obtained.
Step 103, according to the coefficients of the N batches corresponding to the non-basic table, calculating the maximum coefficient and the minimum coefficient corresponding to the non-basic table.
For a non-base table, after the coefficients of the non-base table corresponding to each batch are counted, N coefficients are obtained, and the maximum coefficient and the minimum coefficient are determined from the N coefficients.
Step 104, calculating the predicted maximum data amount and the predicted minimum data amount corresponding to the n+1st batch of the non-basic table according to the maximum coefficient and the minimum coefficient.
In this step, to calculate the predicted maximum data amount and the predicted minimum data amount of the n+1th lot, first, it is necessary to calculate the user increment of the n+1th lot with respect to the N-th lot: collecting the data volume of a basic table of the (n+1) th batch; subtracting the data amount of the base table of the N batch from the data amount of the base table of the N+1st batch to obtain the user increment. Because the base table is used for recording the base data of the user, and each record corresponds to a unique user identifier, it can be determined without doubt that the record number of the base table also characterizes the user number, and the user increment is the data amount of the base table of the (n+1) th batch minus the data amount of the base table of the (N) th batch.
After the user delta is obtained, the predicted maximum data amount and the predicted minimum data amount of the (n+1) th batch are calculated:
the predicted maximum data size of the non-base table at the n+1st lot is: the maximum coefficient is calculated by the data quantity of the nth batch and the increment of the user;
the predicted minimum data size of the non-base table at the n+1th lot is: the non-base table is the minimum coefficient at data volume + user delta for the nth lot.
Step 105, detecting whether the data amount of the non-basic table in the (n+1) th batch is abnormal according to the predicted maximum data amount and the predicted minimum data amount.
In this embodiment, preferably, if the data amount (actually occurring) of the non-base table in the n+1th lot is greater than or equal to the predicted minimum data amount and less than or equal to the predicted maximum data amount, the data amount of the non-base table in the n+1th lot is determined to be normal, otherwise, is determined to be abnormal.
Therefore, in the detection mode, the detection of the data quantity of the current batch depends on the data quantity of the first N batches, and in the detection process, the detection does not need to depend on human experience, does not need indexes provided by manufacturers, does not have complex artificial intelligence algorithms, and achieves the purposes of releasing labor cost, improving quality control precision, improving automation degree, reducing technical requirements, technical input cost and the like.
In order to reduce the influence of abnormal data on the subsequent data detection, the data which has been detected as abnormal is not contained in the collected sample data of N batches at the next detection.
The detection method of the present application may be performed periodically or aperiodically.
The above-described data detection process of the present application is described below by way of a specific embodiment.
Taking the doctor's advice form of detecting a hospital clinic as an example, assuming that N is 5 and T is 1 month, detecting whether the data volume of the doctor's advice form of the current month 11 months is abnormal, then:
1. collecting sample data;
the registration table data volume of 5 batches in total is collected for 6-10 months and expressed as: bg6, bg7, bg8, bg9, bg10; the data volume of the order form (clinic type) of 5 batches for 6-10 months is collected and expressed as: by6, by7, by8, by9, by10.
2, calculating coefficients;
calculating coefficients X6-X10 of the order table corresponding to 5 batches, then:
X6=By6/Bg6;
X7=By7/Bg7;
X8=By8/Bg8;
X9=By9/Bg9;
X10=By10/Bg10;
let X7 be the maximum coefficient and X10 be the minimum coefficient.
3. Calculating the predicted maximum data volume and the predicted minimum data volume of the 11-month medical advice table
User delta w=bg11-bg10; wherein Bg11 is the data volume of a registration list generated in 11 months;
predicting maximum data amount By11 max =By10+w*X7;
Predicting minimum data quantity By11 min =By10+w*X10。
4. Determining whether the data quantity By11 of the doctor's advice table of 11 months is abnormal
If By11 min ≤By11≤By11 max Determining that By11 is normal; otherwise, by11 anomaly is determined.
5. Automatic optimization
When the next detection period arrives, for example, whether the order table of 12 months is abnormal or not is detected, if By11 is already detected as abnormal, the data of the order table of 12 months is detected repeatedly, and then the abnormality detection is carried out on the data of the order table of 12 months, wherein the data of 11 months is not included when the sample data is acquired, the data of the registered table of 5 batches for 6-10 months and the data of the order table (clinic type) of 5 batches for 6-10 months are still acquired.
In addition, if the detection result is incorrect, for example, the actual data amount is normal, but the detection is abnormal, or the detection is abnormal, but the detection is normal, the detection method needs to be optimized, for example, the value of N is adjusted, as shown in fig. 2, including:
step 201, collecting sample data through a time window, wherein the starting length of the time window is M batches, the duration of each batch is T, the starting position of the time window is the previous batch of the batch to be detected, and the ending position of the time window is the previous M batches of the batch to be detected; when sample data are collected each time, the starting position of the time window is unchanged, and the ending position of the time window is moved forward by P batches compared with the last time of collection; the number of times sample data is collected using a time window is a predetermined number of times.
Step 202, calculating an error percentage corresponding to the sample data collected each time through the time window, and taking the batch number of the sample data corresponding to the error percentage with the minimum absolute value as the value of N.
The calculating of the error percentage corresponding to the sample data collected each time comprises the following steps:
calculating the coefficient of the non-basic table to be detected corresponding to each batch in the sample data acquired at the time, and counting the average coefficient corresponding to the non-basic table;
calculating the predicted average data size of the non-basic table to be detected corresponding to the batch to be detected as follows: the average coefficient is calculated by the data quantity of the first batch and the increment of the user of the non-basic table to be detected;
the error percentage of the non-basic table to be detected corresponding to the sample data is calculated as follows: the ratio of the predicted average data volume to the data volume of the non-base table in the batch to be detected is reduced by 1.
Here, the data amount may be predicted by using a linear regression prediction value, a composite growth rate evaluation value, or the like, in addition to the prediction of the data amount using the average coefficient.
When the value of N is adjusted, sample data can be collected in a time window mode, wherein the starting length of the time window is M batches, and if M is 3, the time window is adopted to collect sample data of 3 continuous batches when sample data are collected for the 1 st time, the starting position of the time window is the previous batch of the batch to be detected, the ending position is the previous M batches of the batch to be detected, if the batch to be detected is the 12 th batch, the starting position of the time window is the 11 th batch, and the ending position is the 9 th batch, namely, sample data of 11 th, 10 th and 9 continuous 3 batches are collected when sample data are collected for the 1 st time; each time sample data is collected, the starting position of the time window is unchanged (i.e. from the 11 th batch), the ending position of the time window is moved forward by P batches compared with the last collection, if P is 1, then the ending position of the time window is moved forward by1 batch compared with the ending position of the 1 st time window (9 th batch), i.e. 8 th batch, then the 11 th, 10, 9, 8 th sample data are collected continuously by 4 th batch, then the 11 th, 10, 9, 8 th sample data are collected continuously by 5 th sample data are collected continuously by 3 rd sample data collection, and so on until the predetermined times are collected.
For sample data of the first 3 batches (11 th, 10 th and 9 th total 3 batches) collected at the 1 st time, calculating coefficients of 3 batches corresponding to the non-base table, calculating average coefficients (i.e. average number of 3 coefficients) corresponding to the non-base table, and calculating predicted average data amount of the non-base table corresponding to the current batch (12 th batch) as follows: the non-base table is the data volume of the previous lot (11 th lot) plus the user delta average; then, the error percentage of the non-base table corresponding to the current lot (12 th lot) = (predicted average data amount/data amount of the non-base table at the current lot) -1 is calculated.
By analogy, the error percentages corresponding to the sample data of the first 4 batches (11, 10, 9, 8 consecutive 4 batches) and the first 5 batches (11, 10, 9, 8, 7 consecutive 5 batches) were calculated. Then, the absolute values of the three error percentages are compared, the number of batches corresponding to the length of the time window corresponding to the error percentage with the smallest absolute value is taken as the value of N, and the absolute value of the error percentages of the first 4 batches is calculated to be the smallest, so that N=4 is obtained, and the subsequent data detection is carried out.
It should be noted that:
1. the optimization operation is not easy to frequently and can be performed once every fixed time period (for example, half a year or one year), and when the optimization is performed, the current batch is the batch to be detected. Because frequent optimizations increase the risk of "overfitting";
2. the upper limit and the lower limit of the time window length need to be limited, for example, the upper limit and the lower limit are respectively 3 and 12, namely, the lower limit of the collection is 3 batches in succession, the upper limit of the collection is 12 batches in succession, and the problems of unstable detection caused by insufficient samples and low detection sensitivity caused by excessive samples are solved.
In order to implement the foregoing method for detecting anomalies in data volume, the implementation of the present application further provides a device for detecting anomalies in data volume, which is applied to a data system, where the data system includes at least one service type, each service type having a base table and at least one non-base table, as shown in fig. 3, and the device includes:
the collecting module 10 is configured to collect sample data of N batches for a service type, where a duration of each batch is T, and the sample data of each batch includes a data amount of a base table and data amounts of all non-base tables;
a calculation module 20, configured to count, for any non-base table, coefficients of the non-base table corresponding to each batch according to the data amount of the non-base table and the data amount of the base table; according to the coefficients of N batches corresponding to the non-basic table, calculating the maximum coefficient and the minimum coefficient corresponding to the non-basic table;
a prediction module 30, configured to calculate a predicted maximum data amount and a predicted minimum data amount corresponding to the n+1st lot of the non-basic table according to the maximum coefficient and the minimum coefficient;
a detecting module 40, configured to detect whether the data amount of the n+1st lot of the non-basic table is abnormal according to the predicted maximum data amount and the predicted minimum data amount.
Wherein, in the basic table and the non-basic table, each table contains at least one record, and the data quantity is the record number contained in the table;
the basic table is used for recording basic data of users, and each record corresponds to a unique user identifier; after the user's basic data is generated, the non-basic table is used to record the user's generated associated data.
The calculating module 20 is further configured to, when counting the coefficient of the non-base table corresponding to each batch according to the data amount of the non-base table and the data amount of the base table, calculate, for any batch, the coefficient of the non-base table corresponding to the batch as follows: the ratio of the amount of data in the batch for the non-base table to the amount of data in the batch for the base table.
The calculation of the non-base table is performed when the predicted maximum data amount and the predicted minimum data amount corresponding to the (n+1) th batch are: the prediction module 30 is further configured to collect data amounts of the base table of the n+1st lot; subtracting the data volume of the base table of the N batch from the data volume of the base table of the (n+1) th batch to obtain a user increment; the predicted maximum data size of the non-base table at the n+1st lot is: the maximum coefficient is calculated by the data quantity of the nth batch and the increment of the user; the predicted minimum data size of the non-base table at the n+1th lot is: the non-base table is the minimum coefficient at data volume + user delta for the nth lot.
The detecting module 40 is further configured to determine that the data size of the non-base table in the n+1th lot is normal when the data size of the non-base table in the n+1th lot is greater than or equal to the predicted minimum data size and less than or equal to the predicted maximum data size, or determine that the data size of the non-base table in the n+1th lot is abnormal, if not, the data size of the non-base table in the n+1th lot is abnormal, according to the predicted maximum data size and the predicted minimum data size.
If the detection result is wrong, the device further comprises an optimization module 50, configured to adjust the value of N for the batch to be detected;
the collecting module 10 is further configured to collect sample data through a time window, where a starting length of the time window is M batches, a starting position of the time window is a previous batch of the batch to be detected, and an ending position of the time window is a previous M batches of the batch to be detected; when sample data are collected each time, the starting position of the time window is unchanged, and the ending position of the time window is moved forward by P batches compared with the last time of collection; the number of times of collecting sample data by adopting a time window is a preset number of times;
the optimizing module 50 is further configured to calculate an error percentage corresponding to the sample data collected each time through the time window, and take the batch number of the sample data corresponding to the error percentage with the smallest absolute value as the value of N.
Wherein, when calculating the error percentage corresponding to the sample data collected each time:
the calculating module 20 is further configured to calculate a coefficient corresponding to each batch in the sample data collected at this time for the non-basic table to be detected, and count an average coefficient corresponding to the non-basic table;
the prediction module 30 is further configured to calculate a predicted average data size of the non-basic table to be detected corresponding to the batch to be detected as: the average coefficient is calculated by the data quantity of the first batch and the increment of the user of the non-basic table to be detected;
the optimization module 50 is further configured to calculate that the error percentage of the non-basic table to be detected corresponding to the current sample data is: the ratio of the predicted average data volume to the data volume of the non-base table in the batch to be detected is reduced by 1.
In addition, an embodiment of the present application further provides an apparatus, including:
one or more processors;
a storage means for storing one or more programs;
when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the anomaly detection method for the data volume described above.
Another embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described abnormality detection method for a data amount.
The computer program product may write program code for performing operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.
The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The basic principles of the present application have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not intended to be limiting, and these advantages, benefits, effects, etc. are not to be considered as essential to the various embodiments of the present application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not necessarily limited to practice with the above described specific details.
The block diagrams of the devices, apparatuses, devices, systems referred to in the present application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.
It is also noted that in the apparatus, devices and methods of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent aspects of the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims (8)

1. An anomaly detection method for medical data volume is characterized in that the method is applied to a data system, the data system comprises at least one service type, each service type comprises a basic table and at least one non-basic table, each table comprises at least one record, and the data volume is the record number contained in the table; the basic table is used for recording basic data of users, and each record corresponds to a unique user identifier; after the basic data of the user is generated, the non-basic table is used for recording the associated data generated by the user; the method comprises the following steps:
for a business type, collecting sample data of N batches, wherein the duration of each batch is T, and the sample data of each batch comprises the data volume of a basic table and the data volumes of all non-basic tables; the N is a positive integer, and the T is larger than zero;
for any one of the non-base tables, according to the data amount of the non-base table and the data amount of the base table, the coefficient of the non-base table corresponding to each batch is counted, and for any one batch, the coefficient of the non-base table corresponding to the batch is: the ratio of the amount of data of the non-base table in the batch to the amount of data of the base table in the batch;
according to the coefficients of N batches corresponding to the non-basic table, calculating the maximum coefficient and the minimum coefficient corresponding to the non-basic table;
calculating the predicted maximum data volume and the predicted minimum data volume of the non-basic table corresponding to the (n+1) th batch according to the maximum coefficient and the minimum coefficient;
detecting whether the data amount of the non-basic table in the (n+1) th batch is abnormal or not according to the predicted maximum data amount and the predicted minimum data amount;
the calculating the predicted maximum data amount and the predicted minimum data amount of the non-basic table corresponding to the (n+1) th batch comprises the following steps:
collecting the data volume of a basic table of the (n+1) th batch;
subtracting the data volume of the base table of the N batch from the data volume of the base table of the (n+1) th batch to obtain a user increment;
the predicted maximum data size of the non-base table at the n+1st lot is: the non-base table is the data size of the nth lot + the user deltaThe maximum coefficient;
the predicted minimum data size of the non-base table at the n+1th lot is: the non-base table is the data size of the nth lot + the user deltaThe minimum coefficient.
2. The method of claim 1, wherein detecting whether the data volume of the non-base table at the n+1st lot is abnormal based on the predicted maximum data volume and the predicted minimum data volume comprises:
if the data volume of the non-basic table in the (n+1) th batch is greater than or equal to the predicted minimum data volume and less than or equal to the predicted maximum data volume, determining that the data volume of the non-basic table in the (n+1) th batch is normal, otherwise determining that the data volume is abnormal.
3. The method of claim 2, wherein the collected N batches of sample data do not include data that has been detected as anomalous.
4. The method of claim 1, further comprising, for a lot to be inspected, adjusting the value of N if the inspection result is incorrect, comprising:
sample data are collected through a time window, wherein the starting length of the time window is M batches, the starting position of the time window is the previous batch of the batch to be detected, and the ending position of the time window is the first M batches of the batch to be detected; when sample data are collected each time, the starting position of the time window is unchanged, and the ending position of the time window is moved forward by P batches compared with the last time of collection; the number of times of collecting sample data by adopting a time window is a preset number of times;
and calculating the error percentage corresponding to the sample data acquired each time through the time window, and taking the batch number of the sample data corresponding to the error percentage with the minimum absolute value as the value of N.
5. The method of claim 4, wherein calculating the percentage error for each sample data acquired through the time window comprises:
for sample data acquired at any time, calculating the coefficient of the non-basic table to be detected corresponding to each batch in the sample data acquired at the time, and counting the average coefficient corresponding to the non-basic table;
calculating the predicted average data size of the non-basic table to be detected corresponding to the batch to be detected as follows: the non-basic table to be detected is the data size of the last batch+the user incrementThe average coefficient;
the error percentage of the non-basic table to be detected corresponding to the sample data is calculated as follows: the ratio of the predicted average data volume to the data volume of the non-base table in the batch to be detected is reduced by 1.
6. An abnormality detection device for medical data volume, characterized in that the device is applied to a data system, the data system comprises at least one service type, each service type has a basic table and at least one non-basic table, each table contains at least one record, and the data volume is the record number contained in the table; the basic table is used for recording basic data of users, and each record corresponds to a unique user identifier; after the basic data of the user is generated, the non-basic table is used for recording the associated data generated by the user; the device comprises:
the collection module is used for collecting sample data of N batches aiming at a service type, the duration of each batch is T, and the sample data of each batch comprises the data volume of a basic table and the data volumes of all non-basic tables;
the calculation module is used for counting the coefficient of the non-basic table corresponding to each batch according to the data quantity of the non-basic table and the data quantity of the basic table, and the coefficient of the non-basic table corresponding to the batch is as follows for any batch: the ratio of the amount of data of the non-base table in the batch to the amount of data of the base table in the batch; according to the coefficients of N batches corresponding to the non-basic table, calculating the maximum coefficient and the minimum coefficient corresponding to the non-basic table;
the prediction module is configured to calculate, according to the maximum coefficient and the minimum coefficient, a predicted maximum data amount and a predicted minimum data amount corresponding to the n+1st lot in the non-basic table, where the prediction module includes: collecting the data volume of a basic table of the (n+1) th batch; subtracting the data volume of the base table of the N batch from the data volume of the base table of the (n+1) th batch to obtain a user increment; the predicted maximum data size of the non-base table at the n+1st lot is: the non-base table is the data size of the nth lot + the user deltaThe maximum coefficient; the predicted minimum data size of the non-base table at the n+1th lot is: the non-base table is the data size of the nth lot + the user deltaThe minimum coefficient;
and the detection module is used for detecting whether the data volume of the non-basic table in the (n+1) th batch is abnormal or not according to the predicted maximum data volume and the predicted minimum data volume.
7. An apparatus, the apparatus comprising:
one or more processors;
a storage means for storing one or more programs;
when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-5.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-5.
CN202011478233.4A 2020-12-15 2020-12-15 Data volume anomaly detection method, device, storage medium and equipment Active CN112635031B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011478233.4A CN112635031B (en) 2020-12-15 2020-12-15 Data volume anomaly detection method, device, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011478233.4A CN112635031B (en) 2020-12-15 2020-12-15 Data volume anomaly detection method, device, storage medium and equipment

Publications (2)

Publication Number Publication Date
CN112635031A CN112635031A (en) 2021-04-09
CN112635031B true CN112635031B (en) 2023-08-29

Family

ID=75313495

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011478233.4A Active CN112635031B (en) 2020-12-15 2020-12-15 Data volume anomaly detection method, device, storage medium and equipment

Country Status (1)

Country Link
CN (1) CN112635031B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016033973A1 (en) * 2014-09-05 2016-03-10 中兴通讯股份有限公司 Method and system for predicting resource occupancy
CN106815255A (en) * 2015-11-27 2017-06-09 阿里巴巴集团控股有限公司 The method and device of detection data access exception
CN110517774A (en) * 2019-08-06 2019-11-29 国云科技股份有限公司 A method of prediction abnormal body temperature
CN110839032A (en) * 2019-11-18 2020-02-25 河南牧业经济学院 Internet of things abnormal data identification method and system
CN111612651A (en) * 2020-05-27 2020-09-01 福州大学 Abnormal electric quantity data detection method based on long-term and short-term memory network
CN111694815A (en) * 2020-06-15 2020-09-22 深圳前海微众银行股份有限公司 Database anomaly detection method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2787438A4 (en) * 2011-08-26 2016-06-29 Hitachi Ltd Predictive sequential calculation device
US11607153B2 (en) * 2017-10-30 2023-03-21 Maxell, Ltd. Abnormal data processing system and abnormal data processing method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016033973A1 (en) * 2014-09-05 2016-03-10 中兴通讯股份有限公司 Method and system for predicting resource occupancy
CN106815255A (en) * 2015-11-27 2017-06-09 阿里巴巴集团控股有限公司 The method and device of detection data access exception
CN110517774A (en) * 2019-08-06 2019-11-29 国云科技股份有限公司 A method of prediction abnormal body temperature
CN110839032A (en) * 2019-11-18 2020-02-25 河南牧业经济学院 Internet of things abnormal data identification method and system
CN111612651A (en) * 2020-05-27 2020-09-01 福州大学 Abnormal electric quantity data detection method based on long-term and short-term memory network
CN111694815A (en) * 2020-06-15 2020-09-22 深圳前海微众银行股份有限公司 Database anomaly detection method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
医疗保险数据异常行为检测算法和***;楼磊磊;中国优秀硕士学位论文全文数据库 信息科技辑(第02期);第I138-1097页 *

Also Published As

Publication number Publication date
CN112635031A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
WO2011039741A1 (en) Monitoring device for mangement of insulin delivery
de Bruin et al. Effectiveness of an automated surveillance system for intensive care unit-acquired infections
JP2015532138A (en) System and method for assessing risks associated with glucose status
CN115985523B (en) Digital chronic disease follow-up management system
Lee et al. Prediction of impending mood episode recurrence using real-time digital phenotypes in major depression and bipolar disorders in South Korea: a prospective nationwide cohort study
CN111627523A (en) Clinical nutrition diagnosis and treatment simulation system and simulation method
CN112635031B (en) Data volume anomaly detection method, device, storage medium and equipment
CN111371647A (en) Data center monitoring data preprocessing method and device
CN106821349A (en) For the event generation method and device of wearable custodial care facility
CN117831701A (en) Electronic case quality control method based on rule engine
CN117275644B (en) Detection result mutual recognition method, system and storage medium based on deep learning
EP2861130A2 (en) Segregation system
EP3916556B1 (en) Method and system for detecting anomalies in a data pipeline
CN111737233A (en) Data monitoring method and device
CN114325232B (en) Fault positioning method and device
US20150234997A1 (en) Task optimization in remote health monitoring systems
JP2017021497A (en) Load actual data discrimination device, load prediction device, load actual data discrimination method and load prediction method
US20220020499A1 (en) Method and system for determining a relative risk for lack of glycemic control for a plurality of patients
CN205510066U (en) Well short wave transmitting machine fault early -warning device
Khan et al. Optimized arterial line artifact identification algorithm cleans high-frequency arterial line data with high accuracy in critically ill patients
CN112039715A (en) Network system capacity prediction method and system
CN117591989B (en) Data monitoring method and system for livestock and poultry activities
CN116506205B (en) Data processing method and system of intelligent medical platform
CN114613506B (en) Path prediction control method and device based on big data and storage medium
EP3933850A1 (en) Method, apparatus and computer programs for early symptom detection and preventative healthcare

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant