CN115203167A

CN115203167A - Data detection method and device, computer equipment and storage medium

Info

Publication number: CN115203167A
Application number: CN202210718289.5A
Authority: CN
Inventors: 傅婕; 韦玉凤
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2022-06-22
Filing date: 2022-06-22
Publication date: 2022-10-18

Abstract

The application provides a data detection method, a data detection device, computer equipment and a storage medium, wherein the method comprises the following steps: traversing the data warehouse and determining a target field to be detected currently, wherein the target field is a numerical field in a current data table traversed in the data warehouse; acquiring test data corresponding to a target field; performing isolated forest modeling by taking target data to be detected and test data corresponding to a target field in a current data table as input data to obtain an isolated forest model corresponding to the target field; calculating the abnormal score of each target data according to the isolated forest model; and judging the target data with the abnormality score smaller than the first threshold value as the abnormal data of the target field in the current data table. According to the method and the device, the data corresponding to each target field of each data table in the data warehouse are subjected to abnormal detection by using the isolated forest algorithm, and the efficiency and the accuracy of data detection are improved.

Description

Data detection method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data detection method and apparatus, a computer device, and a storage medium.

Background

The quality of the data warehouse directly determines the quality of the application layer. But the development and data circulation process of the data warehouse is very easy to cause data exception. In the prior art, most testers rely on experience and write a large amount of SQL check codes to check the correctness of data in a data warehouse. However, the number of tables in a data warehouse is hundreds, the data volume of each table is hundreds of millions, and the data flow is multi-layer, huge data needs to compile a large number of SQL check codes, which is huge in workload, not only consumes manpower, but also cannot guarantee the accuracy of the SQL codes, so that the data detection result is lack of reliability.

Disclosure of Invention

The method aims to solve the technical problem that the accuracy and reliability of a detection result cannot be guaranteed by verifying data in a data warehouse through manual experience and codes in the prior art. The application provides a data detection method, a data detection device, computer equipment and a storage medium, and mainly aims to perform anomaly detection on data corresponding to each target field of each data table in a data warehouse by using an isolated forest algorithm, so that the efficiency and the accuracy of data detection are improved.

In order to achieve the above object, the present application provides a data detection method, including:

traversing the data warehouse, and determining a current target field to be detected, wherein the target field is a numerical field in a current data table traversed in the data warehouse;

acquiring test data corresponding to a target field;

performing isolated forest modeling by taking target data to be detected and test data corresponding to a target field in a current data table as input data to obtain an isolated forest model corresponding to the target field;

calculating the abnormal score of each target data according to the isolated forest model;

and judging the target data with the abnormality score smaller than a first threshold value as the abnormal data of the target field in the current data table.

In addition, in order to achieve the above object, the present application also provides a data detection apparatus, including:

the detection field determining module is used for traversing the data warehouse and determining a current target field to be detected, wherein the target field is a numerical field in a current data table traversed in the data warehouse;

the test data screening module is used for acquiring test data corresponding to the target field;

the model building module is used for performing isolated forest modeling by taking target data to be detected and test data corresponding to a target field in a current data table as input data to obtain an isolated forest model corresponding to the target field;

the calculation module is used for calculating the abnormal score of each target data according to the isolated forest model;

and the first judging module is used for judging the target data with the abnormal score smaller than a first threshold value as the abnormal data of the target field in the current data table.

To achieve the above object, the present application further provides a computer device, which includes a memory, a processor, and computer readable instructions stored on the memory and executable on the processor, wherein the processor executes the computer readable instructions to perform the steps of the data detection method according to any one of the preceding claims.

To achieve the above object, the present application further provides a computer readable storage medium having computer readable instructions stored thereon, which, when executed by a processor, cause the processor to perform the steps of the data detection method according to any one of the preceding claims.

According to the data detection method, the data detection device, the computer equipment and the storage medium, the data corresponding to each target field of each data table in the data warehouse are subjected to anomaly detection by using the isolated forest algorithm, so that the anomaly data which are completely different from other data in the data warehouse are identified and distinguished, and the data detection efficiency is improved. The method and the device reduce the compiling of the check code, effectively reduce the workload of data detection and reduce the labor consumption. In addition, compared with the data detection based on the SQL code compiled according to experience, the data anomaly detection performed automatically through the isolated forest algorithm is more objective due to the fact that human intervention is reduced, and the obtained detection result is more accurate and reliable.

Drawings

FIG. 1 is a schematic flow chart illustrating a data detection method according to an embodiment of the present application;

FIG. 2 is a schematic diagram showing the effect of the distribution of isolated points in the data line graph;

FIG. 3 is a block diagram of a data detection apparatus according to an embodiment of the present application;

fig. 4 is a block diagram of an internal structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

Fig. 1 is a schematic flow chart of a data detection method according to an embodiment of the present application. Referring to fig. 1, the data detection method includes the following steps S100 to S500.

S100: and traversing the data warehouse and determining a current target field to be detected, wherein the target field is a numerical field in a current data table traversed in the data warehouse.

Specifically, the data warehouse includes a plurality of hierarchies, specifically including an ODS hierarchy, a DW hierarchy, a DIM hierarchy, and an APP hierarchy. Each hierarchy includes a plurality of data tables, each data table including at least one field and data corresponding to the field.

Wherein the ODS hierarchy directly accesses the original date data streamed from each business system. The DW layering comprises DWS layering, DWM layering and DWD layering; the DWS hierarchy is used for storing data mart (broad-sheet) data, namely highly summarized data; the DWM layer is used for storing slightly summarized data, namely the reserved dimensionality is large; the DWD layer is used for reserving the original granularity of the data, and the data is processed and processed on the basis of the ODS layer, so that cleaner data is provided. The DIM hierarchy, or DM hierarchy, includes a high radix dimension table and a low radix dimension table. The APP hierarchy includes business personalization data serving a particular scenario.

Data in the data warehouse will generally be updated, and particularly financial business scenarios applied to BI decisions are often supported by massive multidimensional data warehouses. This financial-type data warehouse is an integrated, time-varying data set that is oriented to the subject of finance, but the information itself is relatively stable. Among them, BI (Business Intelligence) is a new technology that applies technologies such as data warehouse, online analysis and data mining to process and analyze data, and aims to provide decision support for enterprise decision makers.

The data quality of the data warehouse directly determines the quality of the uppermost layer (application layer), namely accurate and complete data can obtain a correct report form, and reasonable data analysis is carried out, so that business personnel can make a correct decision, otherwise, garbage is input and output. However, the development process of the data warehouse is extremely easy to cause data exception. Therefore, data detection of the binned data in the data warehouse is necessary.

Traversing from one layer to one layer of the data warehouse according to the traversing rule and the circulation sequence of the data in the data warehouse, traversing from one data table to one data table in the same layer, selecting numerical value fields in the same data table, and carrying out data detection on the data corresponding to each numerical value field according to the sequence.

The current data table is a data table of the current hierarchy traversed in the data warehouse. The current data table may include a plurality of numeric fields, and the target field is a field to be currently detected selected according to the traversal rule.

The method can search and determine which fields in the current data table are numerical fields through a predefined field table to be detected, and the numerical fields are used as candidate fields to carry out data detection according to the sequence. Or traversing the fields in the current data table one by one according to the traversal rule, judging whether the field is a field needing to be detected or not according to the field table to be detected after traversing to one field, if so, detecting the data corresponding to the field, and if not, skipping to continue traversing other fields until traversing a data table and then traversing the next data table. Wherein, the numerical type field is a field whose field value is the numerical type.

S200: and acquiring test data corresponding to the target field.

Specifically, the test data is known correct data corresponding to the same field. The test data may be historically correct data for the same field in the data warehouse. The data warehouse stores a large amount of historical data because the data in the data warehouse may increase at intervals. The test data corresponding to each field to be detected can be screened from the historical data.

S300: and performing isolated forest modeling by taking the target data to be detected and the test data corresponding to the target field in the current data table as input data to obtain an isolated forest model corresponding to the target field.

Specifically, source data flows from an upstream system to a data warehouse, where data may flow between tiers in the data warehouse, and some data may undergo different processes in different tiers. Therefore, even data of the same field may have different values and have a large difference between different layers. Therefore, in the embodiment, in order to screen out abnormal data for accurately detecting the correctness of the data, an isolated forest model is constructed for each target field in each data table. The isolated forest model is constructed by using target data and test data of the same field and is used for detecting whether isolated abnormal data exist in the target data of the target field.

The isolated forest modeling specifically comprises the following steps: an isolated forest model variable is created using the scinit-leann library, and four parameters (number of assessors n _ estimators, maximum sample max _ samples, data pollution problem continuations, maximum feature max _ features) are imported to instantiate the isolationForest class. These four parameters directly affect the effect of isolated forests. For example, model = isolationport (n _ estimators =50, registration = 'auto', max _ samples = 'auto', max _ features = 1.0) this is an example of a parameter. Of course, the 4 parameters may also be set by default, specifically according to the actual application.

Wherein, n _ estimators is 100 by default, and determines how many iTree trees are configured to form a forest, namely the construction of a binary tree forest iForest mentioned below.

max samples-default to 265, configure the sample size, referred to below as "construct binary tree iTree, first randomly select X samples from training data".

Contamino, namely estimating the proportion of the abnormal value.

max _ features default to all features aka 1.0, and for high dimensional data only some features may be selected. Binary partitioning of the X samples is based on randomly selecting a feature from a selectable set of features. This parameter determines the proportion of the set of selectable features. Selecting the portion may improve computational efficiency.

And constructing an isolated forest model by using the target data and the test data as input data and using a fit function of the model.

The isolated forest algorithm belongs to an unsupervised learning algorithm, prior knowledge is not needed, binary Search trees (Binary Search trees) are constructed in a multi-iteration mode, then the Binary Search trees form an isolated forest model, the height of the default Binary Search Tree is 8, every 100 trees form a forest, and 256 forests are generated at most each time.

In addition, the target data to be detected is data to be observed which has not been subjected to data detection in the current data table. For example, new data may flow into the data warehouse at intervals of a period of time, and the target data to be detected is data that newly flows into the current data table.

For data in a special field, testers in the prior art need to have related professional knowledge, for example, financial data needs to have financial knowledge, and then SQL test statements can be correctly written for data detection, so that the expertise is high, and the application range is narrow. The construction of the isolated forest model in the embodiment does not need professional knowledge for identifying data in a special field, and isolated data can be objectively judged according to correct historical data. Therefore, the application range is wider.

S400: and calculating the abnormal score of each target data according to the isolated forest model.

Specifically, one target field corresponds to at least one target data, and one target data and one test data are both one sample point. And after the binary tree forest iForest, namely the isolated forest model is constructed, predicting each sample point, wherein the prediction process is to perform recursive middle-order traversal on the binary search tree and record the path length h (x) from the root node to the leaf node. After all the sample path lengths h (x) in the forest are calculated, expected values E (h (x)) and variances S (h (x)) of all the data sample points are calculated by a statistical method, and then the abnormal score of the abnormal data point deviating from the expected values and the variances is obtained. Specifically, a precision _ function is used to return an abnormality score for the abnormal data.

S500: and judging the target data with the abnormality score smaller than the first threshold value as the abnormal data of the target field in the current data table.

Specifically, the lower the anomaly score, the more likely it is to be anomalous data, i.e., outliers in the isolated forest model.

The data detection method is particularly suitable for detecting financial data. Financial data belongs to one of time series data, has strong timeliness and strong dependency before and after data, and is generally two-dimensional continuous structured data. Taking the rate of return unit change as an example, the data source feeds T-1 financial data into the data warehouse, abnormal data in the T-1 financial data are points far away from a high-density group for years of historical detection data, and the abnormal data are isolated outliers for normal data and can be understood as sparsely distributed. Specifically, as shown in the schematic diagram of the distribution effect of the isolated points in the data line graph shown in fig. 2, the point corresponding to the value 20.0 in fig. 2 is the isolated point.

As shown in table 1 below, the table includes several fields for date, daily rate of return, cumulative rate of return, annual rate of return, unit price, capacity, and fair value. The data corresponding to the daily profitability, the accumulated profitability, the annual profitability, the unit price, the capacity and the fair value are numerical data and are used as candidate fields to be detected; data corresponding to a date is non-numeric data, and therefore, detection is not required as a non-candidate field.

TABLE 1

And if the current target field to be detected is the daily yield, calculating the abnormal score of each target data corresponding to the daily yield by using the corresponding isolated forest model. The specific scoring results are shown in table 2:

TABLE 2

It can be known from table 1 and table 2 that the daily yield of 0.08 is far from the daily yields of other dates, and the abnormal score-0.255502 corresponding to the daily yield of 0.08 is far smaller than the abnormal scores of the daily yields of other dates. If the first threshold is set to 0, then the anomaly score of-0.255502 is less than the first threshold, therefore, the daily profitability of 0.08 for 2015-2-1 is anomalous data.

After the 'daily profitability' is detected, automatically traversing to 'accumulated profitability', taking the 'accumulated profitability' as a target field, acquiring test data corresponding to the 'accumulated profitability', constructing a corresponding isolated forest model according to the target data corresponding to the 'accumulated profitability' and the test data, and carrying out abnormal detection on the target data of the 'accumulated profitability' by using the isolated forest model corresponding to the 'accumulated profitability' to obtain abnormal data of the 'accumulated profitability'. And by analogy, the data of all target fields in the data warehouse can be detected in a traversing way.

In the embodiment, the data corresponding to each target field of each data table in the data warehouse is subjected to anomaly detection by using an isolated forest algorithm so as to identify and distinguish anomalous data which is completely different from other data in the data warehouse, and the data detection efficiency is improved. According to the embodiment, the writing of the check code is reduced, the workload of data detection is effectively reduced, and the labor consumption is reduced. In addition, compared with the data detection based on the SQL code compiled according to experience, the automatic data anomaly detection based on the isolated forest algorithm is more objective due to the fact that human intervention is reduced, the obtained detection result is more accurate and reliable, a large amount of relevant professional knowledge is not needed, and the application range is wider.

In one embodiment, step S200 specifically includes:

and extracting historical data of the target field in a first preset time period in a current data table where the target field is located as test data, wherein the target data and the test data are data in different time periods.

Specifically, the test data of the target field in this embodiment is history data of the same field, which is extracted from the same data table in the same hierarchy in a rolling manner, that is, the test data is history data of the target field in the same data table in a history time period. The time period of the last preset time period may be within one year, two years or three years from the current time, and the time period of the last preset time period and the time period of the target data are different time periods. Since the more recent historical data is closer to the current target data to be detected, the more recent historical data is extracted from the same field of the same data table within the first preset time period as the test data, so that not only is the sample data closer to the current data and the reference performance higher, but also the data warehouse table has larger data scale, and the calculation amount can be effectively reduced by extracting part of the historical data as the test data, so that invalid calculation is avoided, and the calculation cost is reduced.

In addition, the test data can also be residual historical data obtained by removing historical abnormal data from historical data of the same field extracted from the same hierarchical data table. And the invalid abnormal data is removed, so that the interference to data detection and the calculated amount can be reduced, and the accuracy of abnormal data detection is improved.

In one embodiment, step S200 specifically includes:

determining a current layer where the current data table is located and a previous layer of the current layer;

if the target data to be detected corresponding to the target field is obtained by importing the data total of the same field of the previous layer of the current layer, extracting historical total data of the same field in a second preset time period close to the previous layer as test data, wherein the target data and the test data are data of different time periods;

if the target data to be detected corresponding to the target field is not obtained by importing the data of the same field of the previous layer of the current layer in full, extracting historical data of the target field in a first preset time period in a current data table where the target field is located as test data, wherein the target data and the test data are data of different time periods.

Specifically, the current tier is one of the bottom-up ODS, DW, DIM, and the like tiers in the data warehouse. Data flows from bottom to top in the data warehouse, i.e., from the ODS layer to the DW layer, and from the DW layer to the DIM layer. The upper level of the current hierarchy is specified in the order in which data is streamed across the levels of the data warehouse. For example, if the current tier is the DW tier, its previous tier is the ODS tier, e.g., table a1 in the ODS tier maps table b2 in the DW tier in full; if the current hierarchy is a DIM hierarchy, its immediately preceding hierarchy is a DW hierarchy, e.g., table b1 in the DW hierarchy maps Table c1 in the DIM hierarchy in full.

And if the current layer where the target field is located is the DW layer and the target data of the target field is obtained by importing the total data of the last ODS layer, extracting historical total data of the same target field in a second preset time period of the ODS layer to serve as test data. For example, historical full data of the target field in the last second preset time period is extracted from the table a1 of the ODS hierarchy as test data of the same target field in the table b2 of the DW hierarchy.

And if the current layer where the target field is located is a DIM layer and the target data of the target field is obtained by introducing the total amount of the upper DW layer, extracting historical total amount data of the same target field in a second preset time period of the DW layer to serve as test data. For example, historical full data of the target field in the last second preset time period is extracted from the table b1 of the DW hierarchy as test data of the same target field in the table c1 of the DIM hierarchy.

The time period of the second preset time period may be within one year, two years or three years from the current time. And the time period of the second preset time period is different from the time period of the target data.

The full data, i.e. the full import data or the full mapping data, is the data that is directly full mapped or imported to the next layer without calculation processing on the data of the previous layer. For the data of the full import, the data of the target field of the current hierarchy is theoretically the same as the data of the target field of the previous hierarchy, and if there is a difference, it indicates that the full import process has an error. Therefore, in this embodiment, the historical full data of the same field in the above hierarchy is used as test data, so that whether data transfer between hierarchies is abnormal or not can be detected to a certain extent, and problems existing in the data transfer process can be found.

In addition, as the closer the historical full data to the current moment is to the current target data to be detected, the more the historical full data of the same target field in the second preset time period in the last layer is extracted as the test data, the sample data is closer to the current data and has higher reference, and as the data scale of the data warehouse table is larger, the calculation amount can be effectively reduced by extracting part of the historical full data as the test data, invalid calculation is avoided, and the calculation expense is reduced.

If the current layer is an ODS layer, the ODS layer does not exist in the data warehouse, so that the test data of the target field is the history data of the same target field, which is extracted from the same data table of the same layer in a rolling manner, namely the test data is the history data of the target field in the history time period in the same data table. The last preset time period may be within one year, two years or three years from the current time. And the time period of the near first preset time period and the time period of the target data are different time periods. Because the more recent historical data is closer to the current target data to be detected, the more recent historical data is taken as the test data, the more similar the sample data is to the current data and the more referential the historical data is, and because the data warehouse table has a larger scale, the more recent the historical data is taken as the test data, the more the computation workload can be effectively reduced, the invalid computation can be avoided, and the computation overhead can be reduced.

In one embodiment, the method further comprises:

acquiring the ratio of the number of abnormal data to the number of target data;

if the ratio exceeds a second threshold value, judging whether the abnormal data is caused by the change of the numerical value entry rule;

and if the abnormal data is determined to be caused by the change of the numerical value entry rule, the abnormal data is determined to be normal data again, and the test data is modified according to the new numerical value entry rule so as to apply the modified test data to subsequent data detection.

Specifically, if the ratio of the number of the abnormal data to the number of the target data to be detected corresponding to the target field exceeds the second threshold, it indicates that the abnormal number of the abnormal data exceeds the normal tolerable amount, and it may be that a large error occurs in the data flow of the data warehouse. One of the greater reasons for this is that the value entry rules have changed.

For example, the data of the original field of the yield rate is recorded in percentile numbers, and the recording rule of the data of the field of the yield rate is changed into original value recording. More specifically, for example, the previous profitability is a value of 1.00%, 1.20%, 1.12%, and the profitability is now changed to the original value entry of 1.00, 1.20, 1.12 due to the business demand. The current test data of 1.00, 1.20 and 1.12 are abnormal data compared with the original test data of 1.00%, 1.20%, 1.12% and the like. But the fact is that the data of the new entry rule is not anomalous data.

Based on the above situation, whether the data is caused by the change of the numerical value entry rule can be determined according to the ratio of the current abnormal data to the test data. For example, the ratio between 1.00 and 1.00% is 100, far exceeding the ratio between different target data of the same field or far exceeding the average of the ratios between different target data of the same field. Alternatively, the ratio between 1.00% and 1.00 is 0.01, which is much smaller than the ratio between different target data of the same field or the average of the ratios between different target data of the same field. It can therefore be determined that the value entry rule is changed. The numerical value entry rule change can be artificially changed according to business requirements and can also be caused by negligence of decimal point dislocation and the like in the entry process.

In addition, if the ratio exceeds the second threshold, the abnormal data is provided to the tester to further judge whether the abnormal data is caused by the change of the numerical value entry rule, and if the abnormal data returned by the tester is received and determined to be caused by the change of the numerical value entry rule, the abnormal data is determined to be the normal data again.

In addition, since the data flowing into the same field of the data warehouse subsequently after the value entry rule is changed are generally entered according to the new value entry rule, in order to facilitate the occurrence of abnormal data misjudgment caused by the change of the value entry rule no longer in the subsequent process, the embodiment modifies the test data according to the new value entry rule, so that the test data is displayed according to the new value entry rule, thus the modified test data is conveniently applied to the subsequent data detection, the normal data flowing according to the new value entry rule subsequently can not be misjudged as abnormal data any more, and the effects of timely automatic error correction and misjudgment reduction are achieved.

In one embodiment, the method further comprises:

and updating the related information of the abnormal data into a test report, wherein the test report comprises the abnormal data and corresponding abnormal scores, abnormal fields and table information of an abnormal data table.

Specifically, after target data of a target field in a data table is traversed and detected, the obtained abnormal data, the corresponding abnormal field and table information of the abnormal data table are extracted and written into a test report.

After all the target fields in the data warehouse are traversed, the completed test report is pushed to relevant personnel, so that the relevant personnel can quickly know the reliability and accuracy of the data flowing into the data warehouse according to the test report, correct abnormal data as soon as possible and take precautionary measures for preventing the abnormal data from appearing again as soon as possible.

In one embodiment, the method further comprises:

and sequencing related information corresponding to different abnormal data of the same target field in the same data table in the test report according to the abnormal scores and displaying the information in the test report.

Specifically, the sorting may be in ascending order or descending order according to the anomaly score. The more advanced the relevant information corresponding to the abnormal data with lower abnormal score is displayed, or the more advanced the relevant information of the abnormal data with higher abnormal score is displayed. Considering that the abnormal score is a 'probability' value, the related information of the abnormal data of the same field of the same data table is displayed according to the abnormal score in a sequencing mode, and a tester can conveniently analyze the abnormal data.

In one embodiment, the determining the current target field to be detected in step S100 includes:

screening out fields with numerical value types corresponding to data in a current data table by using a data dictionary in a data warehouse as candidate fields;

and determining the current target field to be detected from the candidate fields according to the traversal rule.

Specifically, the numeric fields in all the data tables can be extracted for checking according to the data dictionary in the data warehouse, so that the numeric fields can be looked up as target fields by using the data dictionary in the data warehouse. And field data such as date, character string, etc. not applicable to the data dictionary can be excluded.

A Data dictionary (Data dictionary) is used to define and describe Data items, data structures, data streams, data stores, processing logic, and the like of Data, and is an information set describing Data, which is a set of definitions of all Data elements used in a Data warehouse. Accordingly, a numeric type of target field may be filtered out from the data dictionary.

The data dictionary is skillfully used in the embodiment, numerical fields in the data table can be quickly positioned, the searching and screening time is reduced, and the data detection speed and efficiency are increased.

In one embodiment, step S500 specifically includes:

providing the target data with the abnormal score smaller than the first threshold value for a tester to further judge;

and if the abnormal confirmation information of the testing personnel is received, determining the target data with the abnormal score smaller than the first threshold value as abnormal data.

Specifically, in order to reduce the possibility of erroneous judgment, the target data with the abnormal score smaller than the first threshold may be provided to the tester to reconfirm whether the target data is abnormal data, a manual review mechanism is introduced, which may reduce the erroneous judgment, and the reason for the erroneous judgment may be found in the manual review stage, so as to correct and remedy the corresponding data in time.

And if the abnormal denial information of the tester is received, the abnormal data is determined as the normal data again. The tester can also respond to corresponding measures to discover the reason of the misjudgment and give correction after determining the misjudgment.

In another embodiment, if abnormal data is determined to appear, an alarm channel can be triggered, the reason of the abnormal data can be timely found through manual intervention, and the data can be timely corrected.

In one embodiment, traversing the data warehouse in step S100 includes:

and if a target message indicating that the data extraction processing is finished in the message middleware is monitored, starting data detection of the current round and traversing the data warehouse, wherein the target message is pushed to the message middleware through a linkdo task after the data extraction processing is finished.

Specifically, starting a data scheduling task at preset time intervals to extract source data from an upstream system; after the source data are processed, the source data are transferred among all the layers of the data warehouse, so that the data of all the layers of the data warehouse are updated and increased. After the data extraction processing is completed, the linkdo task can produce a target message, push the target message to the message middleware, and wait for consumption of the data detection device. The target message is used for indicating the completion of the data extraction processing, so that the data detection device starts to execute the data detection of the current round.

Data detection is initiated once every time data in the data warehouse is updated. The data detection is used for detecting newly added data in the data warehouse, and other historical data are detected in the last data detection process, so that repeated detection is not needed in the data detection process.

Fig. 3 is a block diagram of a data detection apparatus according to an embodiment of the present application. Referring to fig. 3, the apparatus includes:

the detection field determining module 100 is configured to traverse the data warehouse and determine a target field to be currently detected, where the target field is a numeric field in a current data table traversed in the data warehouse;

the test data screening module 200 is configured to obtain test data corresponding to a target field;

the model building module 300 is used for performing isolated forest modeling by taking target data to be detected and test data corresponding to a target field in a current data table as input data to obtain an isolated forest model corresponding to the target field;

a calculation module 400, configured to calculate an anomaly score of each target data according to the isolated forest model;

the first determining module 500 is configured to determine that the target data with the abnormality score smaller than the first threshold is abnormal data of the target field in the current data table.

In an embodiment, the test data screening module 200 is specifically configured to extract, as the test data, historical data of the target field in a first preset time period in a current data table where the target field is located, where the target data and the test data are data in different time periods.

In an embodiment, the test data screening module 200 specifically includes:

the hierarchy determining module is used for determining the current hierarchy of the current data table and the previous hierarchy of the current hierarchy;

the first data extraction module is used for extracting historical full data of the same field in a second preset time period of a previous layer as test data if the target data to be detected corresponding to the target field is obtained by importing the full data of the same field in the previous layer of the current layer, wherein the target data and the test data are data in different time periods;

and the second data extraction module is used for extracting historical data of the target field in a first preset time period in a current data table where the target field is located as test data if the target data to be detected corresponding to the target field is not obtained by importing the total data of the same field in the last layer of the current layer, wherein the target data and the test data are data in different time periods.

In one embodiment, the apparatus further comprises:

the anomaly ratio calculation module is used for acquiring the ratio of the number of the anomaly data to the number of the target data;

the second judgment module is used for judging whether the abnormal data is caused by the change of the numerical value entry rule or not if the ratio exceeds a second threshold value;

the first correction module is used for re-determining the abnormal data as the normal data if the abnormal data is determined to be caused by the change of the numerical value entry rule;

and the second correction module is used for modifying the test data according to the new numerical value entry rule so as to apply the modified test data to subsequent data detection.

In one embodiment, the apparatus further comprises:

and the test report generation module is used for updating the related information of the abnormal data to the test report, wherein the test report comprises the abnormal data, the corresponding abnormal score, the abnormal field and the table information of the abnormal data table.

In one embodiment, the detection field determination module 100 includes:

the candidate field screening module is used for screening out fields with numerical value types corresponding to data in the current data table as candidate fields by using a data dictionary in the data warehouse;

and the target field determining module is used for determining the current target field to be detected from the candidate fields according to the traversal rule.

In one embodiment, the detection field determining module 100 further comprises:

and the traversing module is used for starting data detection of the current round and traversing the data warehouse if a target message indicating that the data extraction processing is finished is monitored in the message middleware, wherein the target message is pushed to the message middleware through a linkdo task after the data extraction processing is finished.

The method and the device for detecting the abnormal data in the data warehouse utilize the isolated forest algorithm to detect the abnormal data in the data warehouse, are high in speed, small in memory consumption and easy to fall to the ground. Compared with other supervision algorithms which need prior knowledge and take a large amount of marking data as training data in advance, the isolated forest algorithm saves the data marking cost, a tester does not need to spend a large amount of time on writing SQL test codes, the professional requirement on the specific data field of the tester is low, the application range is wide, abnormal data can be objectively detected, and the test range is greatly reduced.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Wherein the meaning of "first" and "second" in the above modules/units is only to distinguish different modules/units, and is not used to define which module/unit has higher priority or other defining meaning. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not explicitly listed or inherent to such process, method, article, or apparatus, and such that a division of modules presented in this application is merely a logical division and may be implemented in a practical application in a further manner.

For specific limitations of the data detection device, see the above limitations for the data detection method, which are not described herein again. The modules in the data detection device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

Fig. 4 is a block diagram of an internal structure of a computer device according to an embodiment of the present application. As shown in fig. 4, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory includes a storage medium and an internal memory. The storage medium may be a nonvolatile storage medium or a volatile storage medium. The storage medium stores an operating system and may also store computer readable instructions that, when executed by the processor, may cause the processor to implement a data detection method. The internal memory provides an environment for the operating system and the execution of computer-readable instructions in the storage medium. The internal memory may also have computer readable instructions stored thereon that, when executed by the processor, cause the processor to perform a method of data detection. The network interface of the computer device is used for communicating with an external server through a network connection. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

In one embodiment, a computer device is provided, which includes a memory, a processor, and computer readable instructions (e.g., a computer program) stored on the memory and executable on the processor, wherein the processor executes the computer readable instructions to implement the steps of the data detection method in the above embodiments, such as the steps S100 to S500 shown in fig. 1 and other extensions of the method and related steps. Alternatively, the processor executes the computer readable instructions to implement the functions of the modules/units of the data detection apparatus in the above embodiments, such as the functions of the modules 100 to 500 shown in fig. 3. To avoid repetition, further description is omitted here.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center of the computer device and the various interfaces and lines connecting the various parts of the overall computer device.

The memory may be used to store computer readable instructions and/or modules, and the processor may implement various functions of the computer apparatus by executing or executing the computer readable instructions and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, video data, etc.) created according to the use of the cellular phone, etc.

The memory may be integrated in the processor or may be provided separately from the processor.

Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer readable storage medium is provided, on which computer readable instructions are stored, which when executed by a processor implement the steps of the data detection method in the above embodiments, such as the steps S100 to S500 shown in fig. 1 and extensions of other extensions and related steps of the method. Alternatively, the computer readable instructions, when executed by the processor, implement the functions of the modules/units of the data detection apparatus in the above embodiments, such as the functions of the modules 100 to 500 shown in fig. 3. To avoid repetition, further description is omitted here.

It will be understood by those of ordinary skill in the art that all or part of the processes of the methods of the embodiments described above may be implemented by instructing associated hardware to implement computer readable instructions, which may be stored in a computer readable storage medium, and when executed, may include processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double-rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, apparatus, article, or method that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present application may be substantially or partially embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all the equivalent structures or equivalent processes that can be directly or indirectly applied to other related technical fields by using the contents of the specification and the drawings of the present application are also included in the scope of the present application.

Claims

1. A method of data detection, the method comprising:

traversing a data warehouse, and determining a current target field to be detected, wherein the target field is a numerical field in a current data table traversed in the data warehouse;

acquiring test data corresponding to the target field;

performing isolated forest modeling by taking the target data to be detected corresponding to the target field in the current data table and the test data as input data to obtain an isolated forest model corresponding to the target field;

2. The method of claim 1, wherein the obtaining the test data corresponding to the target field comprises:

and extracting historical data of the target field in the current data table where the target field is located within a first preset time period as test data, wherein the target data and the test data are data of different time periods.

3. The method of claim 1, wherein the obtaining the test data corresponding to the target field comprises:

determining a current layer where the current data table is located and a last layer of the current layer;

if the target data to be detected corresponding to the target field is obtained by importing the data full quantity of the same field of the previous layer of the current layer, extracting historical full quantity data of the same field of the previous layer in a second preset time period as test data, wherein the target data and the test data are data of different time periods;

if the target data to be detected corresponding to the target field is not obtained by importing the total data of the same field of the previous layer of the current layer, extracting historical data of the target field in the current data table where the target field is located within a first preset time period as test data, wherein the target data and the test data are data of different time periods.

4. The method of claim 1, further comprising:

acquiring the ratio of the number of the abnormal data to the number of the target data;

if the ratio exceeds a second threshold value, judging whether the abnormal data is caused by the change of a numerical value entry rule;

and if the abnormal data is determined to be caused by the change of the numerical value entry rule, re-determining the abnormal data as normal data, and modifying the test data according to a new numerical value entry rule so as to apply the modified test data to subsequent data detection.

5. The method of claim 1, further comprising:

and updating the related information of the abnormal data to a test report, wherein the test report comprises the abnormal data and corresponding abnormal scores, abnormal fields and table information of an abnormal data table.

6. The method of claim 1, wherein determining the current field to be detected comprises:

screening out fields with numerical value types corresponding to the data in the current data table as candidate fields by using a data dictionary in the data warehouse;

7. The method of claim 1, wherein traversing the data store comprises:

8. A data detection apparatus, characterized in that the apparatus comprises:

the test data screening module is used for acquiring the test data corresponding to the target field;

the model building module is used for performing isolated forest modeling by taking the target data to be detected corresponding to the target field in the current data table and the test data as input data to obtain an isolated forest model corresponding to the target field;

9. A computer device comprising a memory, a processor and computer readable instructions stored on the memory and executable on the processor, wherein the processor when executing the computer readable instructions performs the steps of the data detection method of any one of claims 1-7.

10. A computer readable storage medium having computer readable instructions stored thereon, which, when executed by a processor, cause the processor to perform the steps of the data detection method of any one of claims 1-7.