CN117171157A - Clearing data acquisition and cleaning method based on data analysis - Google Patents

Clearing data acquisition and cleaning method based on data analysis Download PDF

Info

Publication number
CN117171157A
CN117171157A CN202311421148.8A CN202311421148A CN117171157A CN 117171157 A CN117171157 A CN 117171157A CN 202311421148 A CN202311421148 A CN 202311421148A CN 117171157 A CN117171157 A CN 117171157A
Authority
CN
China
Prior art keywords
data
cleaning
preliminary
statistical
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311421148.8A
Other languages
Chinese (zh)
Other versions
CN117171157B (en
Inventor
贾庆佳
张磊
冯伟
张志勇
刘永峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Off Site Market Clearing Center Co ltd
Original Assignee
Qingdao Off Site Market Clearing Center Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Off Site Market Clearing Center Co ltd filed Critical Qingdao Off Site Market Clearing Center Co ltd
Priority to CN202311421148.8A priority Critical patent/CN117171157B/en
Publication of CN117171157A publication Critical patent/CN117171157A/en
Application granted granted Critical
Publication of CN117171157B publication Critical patent/CN117171157B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a clearing data collection and cleaning method based on data analysis, which belongs to the field of data processing systems specially suitable for management purposes.

Description

Clearing data acquisition and cleaning method based on data analysis
Technical Field
The application belongs to the technical field of data processing systems specially suitable for management purposes, and particularly relates to a clearing data acquisition and cleaning method based on data analysis.
Background
In the existing commodity sales process, sales volume is usually counted every day, and because of errors of manual or mechanical input statistics, the situation that sales volume does not correspond to statistics volume often occurs, at this time, repeated clearing is needed for statistical data to find out abnormal values in the sales volume to clean, manual cleaning is time-consuming and labor-consuming, the existing collection cleaning method can only approximately find out few cleaning abnormal values which do not accord with some objective rules in the data, and misjudgment and less judgment are easy to occur, so that cleaning efficiency of the clearing data is low, and the problems in the prior art are all caused;
a data cleaning method and a data cleaning apparatus are disclosed, for example, in chinese patent application publication No. CN114238305 a. The method comprises the following steps: generating a constraint condition set for screening the data acquired by the data acquisition equipment; within each predefined time interval: receiving data acquired by the data acquisition equipment in a predefined time interval and monitoring information generated by the monitoring equipment of the data acquisition equipment in the predefined time interval; determining whether the acquisition within a predefined time interval is abnormal or not according to the monitoring information; if the abnormality exists, the quality identification of the data is set to be low quality; if no abnormality exists, determining whether the data meets a constraint condition set; if any constraint condition in the constraint condition set is not met, setting the quality identifier to be low quality; if each constraint condition in the constraint condition sets is met, setting the quality identifier as middle quality or high quality according to whether the change rate of the data relative to the data acquired in the previous predefined time interval exceeds a predefined change rate threshold; storing the data and the quality identification in association;
also disclosed in, for example, chinese patent application publication No. CN111427873a is a data cleaning method and system, wherein the method comprises: each piece of data in the first data is sequentially subjected to data cleaning based on task mapping configuration to obtain corresponding result data, the first data comprises target data, sample data and standard result data of the sample data, the sample data corresponds to the data type of the target data, the standard result data accords with the task mapping configuration, and the result data is target result data or sample result data; and when the result data is sample result data, matching the sample result data with standard result data corresponding to the sample result data, and generating quality inspection data based on a matching result. The application can carry out quality inspection on the attribute and the content of the cleaning result in the cleaning process, and generate corresponding quality inspection data, thereby being convenient for staff to adjust the cleaning work in time according to the quality inspection data, and improving the cleaning efficiency while improving the cleaning quality.
The problems proposed in the background art exist in the above patents: because of errors of manual or mechanical input statistics, the situation that sales volume and statistics amount do not correspond often occurs, repeated clearing is needed to find out abnormal values in the statistics data to clean, manual cleaning is time-consuming and labor-consuming, the existing collection cleaning method can only approximately find out few cleaning abnormal values which do not accord with some objective rules in the data, misjudgment and less judgment are easy to occur, and clearing data cleaning efficiency is low.
Disclosure of Invention
Aiming at the defects of the prior art, the application provides a clearing data acquisition and cleaning method based on data analysis, the application classifies the data to be cleared in the extraction statistical period according to the types of the data to form a clearing data table, simultaneously acquires the error condition of the historical statistical data of the statistical personnel and the historical change curve of the statistical data, selects the types of the data to be cleaned from the clearing data table, simultaneously acquires the influence factors affecting the types of the data through the historical change curve of the statistical data, calculates the influence coefficient of the influence factors, guides the data in the types of the data to be cleaned into the preliminary screening strategy for preliminary screening of the cleaning data, derives the preliminary cleaning data obtained by the preliminary screening, combines the error condition of the historical statistical data of the preliminary cleaning data, introduces the statistical error probability of the calculated preliminary cleaning data in the calculation strategy of the error value of the preliminary cleaning data, substitutes the statistical error condition of the preliminary cleaning data and the statistical error probability of the preliminary cleaning data into the calculation strategy for calculating the data cleaning rate, compares the calculated data cleaning rate with the set cleaning threshold value to obtain the corresponding data to be the data required to be compared with the data required to be the threshold value which is smaller than the required to be compared when the data required to be cleaned is the threshold value to be compared with the data required to be obtained to be cleaned, and the data required to be compared with the data required to be cleaned is obtained to be compared with the threshold value to be required to be cleaned, and the data is required to be cleaned to be compared to be cleaned, the data cleaning efficiency is optimized, and the data cleaning accuracy is improved.
In order to achieve the above purpose, the present application provides the following technical solutions:
the clearing data acquisition and cleaning method based on data analysis comprises the following specific steps:
s1, extracting data to be cleared in a statistical period, classifying the data to be cleared according to the types of the data to form a clearing data table, and simultaneously acquiring the error condition of historical statistical data of statistical personnel and a historical change curve of the statistical data;
s2, selecting the data types to be cleaned from the clearing data table, acquiring influence factors influencing the data types through a statistical data history change curve, calculating influence coefficients of the influence factors, and importing the data in the data types to be cleaned into a preliminary screening strategy for preliminary screening of cleaning data;
s3, deriving preliminary cleaning data obtained by preliminary screening, and calculating abnormal values of the preliminary cleaning data by combining influence factors of data types;
s4, the preliminary cleaning data obtained by preliminary screening are led out and combined with the historical statistical data error condition of personnel, and are led into a preliminary cleaning data error value calculation strategy to calculate the statistical error probability of the preliminary cleaning data;
s5, substituting the abnormal value of the preliminary cleaning data and the statistical error probability of the preliminary cleaning data into a cleaning strategy to calculate the data cleaning rate;
s6, comparing the calculated data cleaning rate with a set cleaning rate threshold, if the obtained data cleaning rate is greater than or equal to the cleaning rate threshold, indicating that the corresponding preliminary cleaning data is the data needing cleaning, and if the obtained data cleaning rate is smaller than the cleaning rate threshold, indicating that the corresponding preliminary cleaning data is the data not needing cleaning;
and S7, exporting the obtained data to be cleaned, cleaning the corresponding data in the clearing data table, and simultaneously transmitting the data to be cleaned to a manager for checking the cleaning data.
Specifically, the step S1 includes the following specific steps:
s11, extracting data to be cleared in a statistical period stored in a storage by reading the storage with clearing data, wherein the data comprise data to be cleared periodically, such as material storage data, material extraction data, material stock data and the like;
s12, classifying the extracted data to be cleared according to the types of the data, forming a cleared data table for the cleared data according to determinant classification, and simultaneously obtaining the historical statistics data error condition of statistics staff, wherein the statistics data error condition comprises the types, frequency and error modes of error data, for example, statistics staff counts the material A into the material B for 3 times, the types of the error data are materials, the frequency is three, and the error mode is A statistics;
s13, acquiring a historical change curve of the statistical data, wherein the historical change curve is a data change curve formed by acquisition values of the statistical data at historical acquisition time.
Specifically, the preliminary screening policy in S2 includes the following specific contents:
s21, an analyst selects the data types to be cleaned from the clearing data table, and meanwhile, acquires influence factors influencing the data types, extracts influence coefficients of the calculated influence factors on the cleaned data types, and the calculation formula is as follows:wherein n is the number of days of periodic statistics, +.>For the influence factor of the ith influence factor on the kind of wash data, +.>For j+1 day historical cleansing data, +.>For j days of historical cleansing data, +.>The value of the i-th influence factor on day j+1 is->The value of the influence factor of the ith item on the jth day is obtained;
s22, constructing a preliminary screening model of preliminary prediction clearing data based on the specific value of the influence factor and the influence coefficient of the influence factor, wherein the screening formula of the preliminary screening model is as follows:wherein->For the predicted data value of the data category requiring cleaning for z+1 days, w is the number of influencing factors, +.>Data values for the data types requiring cleaning for the statistical z days;
s23, substituting the data value of the data type to be cleaned into the preliminary screening model, predicting the data value of the last day through the data value of the last day to obtain a data prediction table, checking the data prediction table with the corresponding days of the clearing data table, selecting the data of the clearing data table corresponding to the data value with the phase difference value larger than or equal to the set phase difference threshold value as preliminary cleaning data, wherein the phase difference value calculation mode is as follows: the absolute value of the difference between the data of the data prediction table and the data of the corresponding days of the clearing data table is divided by the data of the corresponding days of the clearing data table.
Specifically, the step of deriving the preliminary cleaning data obtained by the preliminary screening in S3, and calculating the outlier of the preliminary cleaning data by combining the influence factors and the influence coefficients of the data types includes the following specific steps:
s31, extracting phase difference values of days corresponding to the preliminary cleaning data obtained in the step S23, setting the extracted phase difference values as a first phase difference value set, and extracting phase difference values of days corresponding to data except the preliminary cleaning data, setting the extracted phase difference values as a second phase difference value set;
s32, substituting the data in the first phase difference value set and the data in the second phase difference value set into an outlier calculation formula to calculate outlier of the preliminary cleaning data, wherein the outlier calculation formula of the a preliminary cleaning data is thatWherein->For the s-th phase difference value in the second phase difference value set, r is the number of the second phase difference values in the second phase difference value set,is the a-th phase difference value in the first phase difference value set.
Specifically, the specific content of the preliminary cleaning data error value calculation strategy in S4 is as follows:
extracting error conditions of historical statistical data of personnel to obtain statistical error probability of kth preliminary cleaning data, wherein a statistical error probability calculation formula is as follows:wherein->For the number of times of occurrence of the data error of the v th bit of the kth preliminary wash data, +.>For the total number of errors occurring, this is illustrated by: if the preliminary cleaning data is 123, the statistics staff 1 counts the errors 2 times, 2 counts the errors 8 times, 3 counts the errors 2 times, and the data statistics is totally wrong 82 times in the statistics, so that the error rate of the preliminary cleaning data is calculated to be 12/82.
Specifically, the cleaning strategy in S5 includes the following specific matters:
extracting an abnormal value corresponding to the preliminary cleaning data and a statistical error probability value of the preliminary cleaning data, and importing the abnormal value corresponding to the preliminary cleaning data and the statistical error probability value of the preliminary cleaning data into a cleaning rate calculation formula for calculating the cleaning rate, wherein the cleaning rate calculation formula is as follows:wherein->The wash rate for the kth preliminary wash data, +.>For the outlier of the kth preliminary wash data, +.>For the cleaning rate duty cycle, +.>For counting the error probability value duty factor, +.>
Here, it is to be noted that, hereAnd the value of the cleaning rate threshold is obtained by selecting at least 500 groups of abnormal values of the preliminary cleaning data and statistical error probability values of the preliminary cleaning data, simultaneously manually selecting the data needing cleaning, importing fitting software, and continuously iterating to obtain +.>And an optimal solution for the wash rate threshold.
An electronic device, comprising: a processor and a memory, wherein the memory stores a computer program for the processor to call;
the processor executes the above-described data analysis-based clearing data acquisition cleaning method by calling a computer program stored in the memory.
A computer readable storage medium storing instructions that when executed on a computer cause the computer to perform a data analysis based clearing data acquisition cleaning method as described above.
Compared with the prior art, the application has the beneficial effects that:
the method comprises the steps of extracting data to be cleaned in a statistical period, classifying the data to be cleaned according to the types of the data to form a cleaning data table, simultaneously obtaining the error condition of statistical personnel historical statistical data and the statistical data historical change curve, selecting the data types to be cleaned from the cleaning data table, simultaneously obtaining influence factors influencing the data types through the statistical data historical change curve, calculating influence factors, guiding the data in the data types to be cleaned into a preliminary screening strategy to perform preliminary screening of cleaning data, guiding the preliminary cleaning data obtained by preliminary screening, calculating abnormal values of the preliminary cleaning data in combination with the influence factors of the data types, guiding the preliminary cleaning data obtained by preliminary screening out in combination with the error condition of the personnel historical statistical data, guiding the statistical error probability of the preliminary cleaning data in the preliminary cleaning data error value calculation strategy, substituting the abnormal values of the preliminary cleaning data and the statistical error probability of the preliminary cleaning data into the cleaning strategy to perform data cleaning rate calculation, comparing the calculated data cleaning rate with a set cleaning rate threshold, and if the obtained data cleaning rate is larger than or equal to the cleaning threshold, guiding the data to the required data to be cleaned, and the data to be cleaned to be compared with the data required to be cleaned, and the data required to be compared is not required to be compared with the data, and the data required to be cleaned is obtained, and the data is improved, and the accuracy is improved.
Drawings
FIG. 1 is a schematic flow chart of a clearing data collection and cleaning method based on data analysis;
fig. 2 is a schematic diagram of steps of a clearing data collection and cleaning method S1 based on data analysis according to the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments.
Example 1
Referring to fig. 1, an embodiment of the present application is provided: the clearing data acquisition and cleaning method based on data analysis comprises the following specific steps:
s1, extracting data to be cleared in a statistical period, classifying the data to be cleared according to the types of the data to form a clearing data table, and simultaneously acquiring the error condition of historical statistical data of statistical personnel and a historical change curve of the statistical data;
in this embodiment, S1 includes the following specific steps:
s11, extracting data to be cleared in a statistical period stored in a storage by reading the storage with clearing data, wherein the data comprise data to be cleared periodically, such as material storage data, material extraction data, material stock data and the like;
s12, classifying the extracted data to be cleared according to the types of the data, forming a cleared data table for the cleared data according to determinant classification, and simultaneously obtaining the historical statistics data error condition of statistics staff, wherein the statistics data error condition comprises the types, frequency and error modes of error data, for example, statistics staff counts the material A into the material B for 3 times, the types of the error data are materials, the frequency is three, and the error mode is A statistics;
s13, acquiring a historical change curve of statistical data, wherein the historical change curve is a data change curve formed by acquisition values of the statistical data at historical acquisition time;
s2, selecting the data types to be cleaned from the clearing data table, acquiring influence factors influencing the data types through a statistical data history change curve, calculating influence coefficients of the influence factors, and importing the data in the data types to be cleaned into a preliminary screening strategy for preliminary screening of cleaning data;
in this embodiment, the preliminary screening policy in S2 includes the following specific contents:
s21, an analyst selects a data type to be cleaned from a clearing data table, and simultaneously acquires influence factors influencing the data type, wherein the influence factors comprise environmental factors such as temperature, precipitation, temperature difference change, humidity and the like, the influence coefficients of the influence factors on the cleaning data type are extracted and calculated, and a calculation formula is as follows:wherein n is the number of days of periodic statistics, +.>For the influence factor of the ith influence factor on the kind of wash data, +.>For j+1 day historical cleansing data, +.>For j days of historical cleansing data, +.>The value of the i-th influence factor on day j+1 is->The value of the influence factor of the ith item on the jth day is obtained;
the following is the C language code for calculating the impact factor according to a given formula:
#include <stdio.h>
Float calculateImpactFactor(float*data,float* environmentalFactor, int days) {
float impactFactor = 0.0;
for (int i = 0; i < days - 1; i++) {
float deltaT = data[i + 1] - data[i];
float deltaEnvironmental = environmentalFactor[i + 1] - environmentalFactor[i];
impactFactor += deltaT / deltaEnvironmental;
}
impactFactor /= (days - 1);
return impactFactor;
}
int main() {
float data [ ] = { 3.2, 4.5, 5.1, 4.8, 6.3;// historical wash data
float environmentalFactor [ ] = { 20.0, 25.0, 22.0, 23.5, 24.0;// environmental factor data
int days=sizeof (data)/sizeof (float);// days of statistics
float impactFactor = calculateImpactFactor(data, environmentalFactor, days);
printf ("influencing factor:%. 2f)
", impactFactor);
return 0;
}
The ' data ' array in the code is historical cleaning data, the ' environmental factor array is environmental factor data, and the ' days ' is statistical days. The 'calcualteimaactfactor' function is used for calculating an influence factor, a floating point number is returned as a result, data in the 'main' function is transmitted to the 'calcualteimaactfactor' function for calculation, and a calculation result is printed;
note that this is just one example code, and needs to be modified and verified appropriately according to the specific situation when actually used;
s22, specific value and influence based on influence factorsThe influence coefficient of the factors is used for constructing a preliminary screening model of preliminary prediction clearing data, and the screening formula of the preliminary screening model is as follows:wherein->For the predicted data value of the data category requiring cleaning for z+1 days, w is the number of influencing factors, +.>Data values for the data types requiring cleaning for the statistical z days;
s23, substituting the data value of the data type to be cleaned into the preliminary screening model, predicting the data value of the last day through the data value of the last day to obtain a data prediction table, checking the data prediction table with the corresponding days of the clearing data table, selecting the data of the clearing data table corresponding to the data value with the phase difference value larger than or equal to the set phase difference threshold value as preliminary cleaning data, wherein the phase difference value calculation mode is as follows: the absolute value of the difference between the data of the data prediction table and the data of the corresponding days of the clearing data table is divided by the data of the corresponding days of the clearing data table;
s3, deriving preliminary cleaning data obtained by preliminary screening, and calculating abnormal values of the preliminary cleaning data by combining influence factors of data types;
in this embodiment, the specific steps of deriving the preliminary cleaning data obtained by the preliminary screening in S3 and calculating the outlier of the preliminary cleaning data by combining the influence factors and the influence coefficients of the data types are as follows:
s31, extracting phase difference values of days corresponding to the preliminary cleaning data obtained in the step S23, setting the extracted phase difference values as a first phase difference value set, and extracting phase difference values of days corresponding to data except the preliminary cleaning data, setting the extracted phase difference values as a second phase difference value set;
s32, substituting the data in the first phase difference value set and the data in the second phase difference value set into an outlier calculation formula to calculate outlier of the preliminary cleaning data, wherein the outlier calculation formula of the a preliminary cleaning data is thatWherein->For the s-th phase difference value in the second phase difference value set, r is the number of the second phase difference values in the second phase difference value set,a, a first phase difference value in the first phase difference value set;
s4, the preliminary cleaning data obtained by preliminary screening are led out and combined with the historical statistical data error condition of personnel, and are led into a preliminary cleaning data error value calculation strategy to calculate the statistical error probability of the preliminary cleaning data;
in this embodiment, the specific content of the preliminary cleaning data error value calculation policy in S4 is:
extracting error conditions of historical statistical data of personnel to obtain statistical error probability of kth preliminary cleaning data, wherein a statistical error probability calculation formula is as follows:wherein->For the number of times of occurrence of the data error of the v th bit of the kth preliminary wash data, +.>For the total number of errors occurring, this is illustrated by: if the preliminary cleaning data is 123, in the statistics, the statistics personnel 1 counts the errors 2 times, 2 counts the errors 8 times, 3 counts the errors 2 times, and the data statistics is totally wrong 82 times, so that the error rate of the preliminary cleaning data obtained by calculation is 12/82;
s5, substituting the abnormal value of the preliminary cleaning data and the statistical error probability of the preliminary cleaning data into a cleaning strategy to calculate the data cleaning rate;
s6, comparing the calculated data cleaning rate with a set cleaning rate threshold, if the obtained data cleaning rate is greater than or equal to the cleaning rate threshold, indicating that the corresponding preliminary cleaning data is the data needing cleaning, and if the obtained data cleaning rate is smaller than the cleaning rate threshold, indicating that the corresponding preliminary cleaning data is the data not needing cleaning;
in this embodiment, the cleaning strategy in S5 includes the following specific matters:
extracting an abnormal value corresponding to the preliminary cleaning data and a statistical error probability value of the preliminary cleaning data, and importing the abnormal value corresponding to the preliminary cleaning data and the statistical error probability value of the preliminary cleaning data into a cleaning rate calculation formula for calculating the cleaning rate, wherein the cleaning rate calculation formula is as follows:wherein->The cleaning rate for the kth preliminary cleaning data,for the outlier of the kth preliminary wash data, +.>For the cleaning rate duty cycle, +.>For counting the error probability value duty factor, +.>
Here, it is to be noted that, hereAnd the value of the cleaning rate threshold is obtained by selecting at least 500 groups of abnormal values of the preliminary cleaning data and statistical error probability values of the preliminary cleaning data, simultaneously manually selecting the data needing cleaning, importing fitting software, and continuously iterating to obtain +.>And an optimal solution to the cleaning rate threshold;
s7, the obtained data to be cleaned are exported, the corresponding data in the clearing data table are cleaned, and meanwhile the data to be cleaned are transmitted to a manager for checking the cleaning data;
the method comprises the steps of extracting data to be cleaned in a statistical period, classifying the data to be cleaned according to the types of the data to form a cleaning data table, simultaneously obtaining the error condition of statistical personnel historical statistical data and the statistical data historical change curve, selecting the data types to be cleaned from the cleaning data table, simultaneously obtaining influence factors influencing the data types through the statistical data historical change curve, calculating influence factors, guiding the data in the data types to be cleaned into a preliminary screening strategy to perform preliminary screening of cleaning data, guiding the preliminary cleaning data obtained by preliminary screening, calculating abnormal values of the preliminary cleaning data in combination with the influence factors of the data types, guiding the preliminary cleaning data obtained by preliminary screening out in combination with the error condition of the personnel historical statistical data, guiding the statistical error probability of the preliminary cleaning data in the preliminary cleaning data error value calculation strategy, substituting the abnormal values of the preliminary cleaning data and the statistical error probability of the preliminary cleaning data into the cleaning strategy to perform data cleaning rate calculation, comparing the calculated data cleaning rate with a set cleaning rate threshold, and if the obtained data cleaning rate is larger than or equal to the cleaning threshold, guiding the data to the required data to be cleaned, and the data to be cleaned to be compared with the data required to be cleaned, and the data required to be compared is not required to be compared with the data, and the data required to be cleaned is obtained, and the data is improved, and the accuracy is improved.
Example 2
The present embodiment provides an electronic device including: a processor and a memory, wherein the memory stores a computer program for the processor to call;
the processor performs the above-described data analysis-based clearing data acquisition cleaning method by calling a computer program stored in the memory.
The electronic device may vary greatly in configuration or performance, and can include one or more processors (Central Processing Units, CPU) and one or more memories, where the memories store at least one computer program that is loaded and executed by the processors to implement the data analysis-based clearing data acquisition cleaning method provided by the above method embodiments. The electronic device can also include other components for implementing the functions of the device, for example, the electronic device can also have wired or wireless network interfaces, input-output interfaces, and the like, for inputting and outputting data. The present embodiment is not described herein.
Example 3
The present embodiment proposes a computer-readable storage medium having stored thereon an erasable computer program;
the computer program, when run on a computer device, causes the computer device to perform the above-described data analysis-based clearing data acquisition cleaning method.
For example, the computer readable storage medium can be Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), compact disk Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), magnetic tape, floppy disk, optical data storage device, etc.
It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.
It should be understood that determining B from a does not mean determining B from a alone, but can also determine B from a and/or other information.
The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by way of wired or/and wireless networks from one website site, computer, server, or data center to another. Computer readable storage media can be any available media that can be accessed by a computer or data storage devices, such as servers, data centers, etc. that contain one or more collections of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the partitioning of units is merely one way of partitioning, and there may be additional ways of partitioning in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The preferred embodiments of the application disclosed above are intended only to assist in the explanation of the application. The preferred embodiments are not intended to be exhaustive or to limit the application to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and the practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and the full scope and equivalents thereof.

Claims (9)

1. The clearing data acquisition and cleaning method based on data analysis is characterized by comprising the following specific steps of:
s1, extracting data to be cleared in a statistical period, classifying the data to be cleared according to the types of the data to form a clearing data table, and simultaneously acquiring the error condition of historical statistical data of statistical personnel and a historical change curve of the statistical data;
s2, selecting the data types to be cleaned from the clearing data table, acquiring influence factors influencing the data types through a statistical data history change curve, calculating influence coefficients of the influence factors, and importing the data in the data types to be cleaned into a preliminary screening strategy for preliminary screening of cleaning data;
s3, deriving preliminary cleaning data obtained by preliminary screening, and calculating abnormal values of the preliminary cleaning data by combining influence factors of data types;
s4, the preliminary cleaning data obtained by preliminary screening are led out and combined with the historical statistical data error condition of personnel, and are led into a preliminary cleaning data error value calculation strategy to calculate the statistical error probability of the preliminary cleaning data;
s5, substituting the abnormal value of the preliminary cleaning data and the statistical error probability of the preliminary cleaning data into a cleaning strategy to calculate the data cleaning rate;
s6, comparing the calculated data cleaning rate with a set cleaning rate threshold, if the obtained data cleaning rate is greater than or equal to the cleaning rate threshold, indicating that the corresponding preliminary cleaning data is the data needing cleaning, and if the obtained data cleaning rate is smaller than the cleaning rate threshold, indicating that the corresponding preliminary cleaning data is the data not needing cleaning;
and S7, exporting the obtained data to be cleaned, cleaning the corresponding data in the clearing data table, and simultaneously transmitting the data to be cleaned to a manager for checking the cleaning data.
2. The data analysis based clearing data collection and cleaning method according to claim 1, wherein S1 comprises the specific steps of:
s11, extracting data to be cleared in a statistical period stored in a storage by reading the storage with the clearing data;
s12, classifying the extracted data to be cleared according to the types of the data, forming a cleared data table for the cleared data according to determinant classification, and simultaneously acquiring the situation of historical statistical data errors of statistical personnel;
s13, acquiring a historical change curve of the statistical data, wherein the historical change curve is a data change curve formed by acquisition values of the statistical data at historical acquisition time.
3. The data analysis-based clearing data collection and cleaning method according to claim 2, wherein the preliminary screening strategy in S2 includes the following specific contents:
s21, an analyst selects the data types to be cleaned from the clearing data table, and meanwhile, acquires influence factors influencing the data types, extracts influence coefficients of the calculated influence factors on the cleaned data types, and the calculation formula is as follows:wherein n is the number of days of periodic statistics, +.>For the influence factor of the ith influence factor on the kind of wash data, +.>For j+1 day historical cleansing data, +.>For j days of historical cleansing data, +.>The value of the i-th influence factor on day j+1 is->The value of the influence factor of the ith item on the jth day is obtained;
s22, constructing a preliminary screening model of preliminary prediction clearing data based on the specific value of the influence factor and the influence coefficient of the influence factor, wherein the screening formula of the preliminary screening model is as follows:wherein->For the predicted data value of the data category requiring cleaning for z+1 days, w is the number of influencing factors, +.>Data values for the data types requiring cleansing for the statistical z days.
4. The data analysis-based clearing data collection and cleaning method according to claim 3, wherein the preliminary screening strategy in S2 further comprises the following specific contents:
s23, substituting the data value of the data type to be cleaned into a preliminary screening model, predicting the data value of the last day through the data value of the last day to obtain a data prediction table, checking the data prediction table with the corresponding days of the clearing data table, and selecting the data of the clearing data table corresponding to the data value with the phase difference value larger than or equal to the set phase difference threshold value as preliminary cleaning data, wherein the phase difference value calculation mode is as follows: the absolute value of the difference between the data of the data prediction table and the data of the corresponding days of the clearing data table is divided by the data of the corresponding days of the clearing data table.
5. The method for collecting and cleaning data based on data analysis as claimed in claim 4, wherein the specific steps of deriving the preliminary cleaning data obtained by the preliminary screening in S3, and calculating the outlier of the preliminary cleaning data by combining the influence factors and the influence coefficients of the data types are as follows:
s31, extracting phase difference values of days corresponding to the preliminary cleaning data obtained in the step S23, setting the extracted phase difference values as a first phase difference value set, and extracting phase difference values of days corresponding to data except the preliminary cleaning data, setting the extracted phase difference values as a second phase difference value set;
s32, substituting the data in the first phase difference value set and the data in the second phase difference value set into an outlier calculation formula to calculate outlier of the preliminary cleaning data, wherein the outlier calculation formula of the a preliminary cleaning data is thatWherein->For the s-th disparity value in the second disparity value set, r is the number of second disparity values in the second disparity value set, +.>Is the a-th phase difference value in the first phase difference value set.
6. The data analysis-based clearing data collection and cleaning method according to claim 5, wherein the preliminary cleaning data error value calculation strategy in S4 specifically includes:
extracting error conditions of historical statistical data of personnel to obtain statistical error probability of kth preliminary cleaning data, wherein a statistical error probability calculation formula is as follows:wherein->For the number of times of occurrence of the data error of the v th bit of the kth preliminary wash data, +.>Is the total number of errors that occur.
7. The data analysis based clearing data collection and cleansing method of claim 6 wherein the cleansing strategy comprises the specific steps of:
extracting an abnormal value corresponding to the preliminary cleaning data and a statistical error probability value of the preliminary cleaning data, and importing the abnormal value corresponding to the preliminary cleaning data and the statistical error probability value of the preliminary cleaning data into a cleaning rate calculation formula for calculating the cleaning rate, wherein the cleaning rate calculation formula is as follows:wherein->The wash rate for the kth preliminary wash data, +.>For the outlier of the kth preliminary wash data, +.>For the cleaning rate duty cycle, +.>To account for the error probability value duty cycle coefficients,
8. a human-machine interaction device, comprising: a processor and a memory, wherein the memory stores a computer program for the processor to call;
the method for cleaning data collection based on data analysis according to any one of claims 1 to 7, characterized in that the processor executes the method for cleaning data collection based on data analysis by calling a computer program stored in the memory.
9. A computer readable storage medium storing instructions which, when executed on a computer, cause the computer to perform the data analysis based clearing data acquisition cleaning method of any one of claims 1 to 7.
CN202311421148.8A 2023-10-31 2023-10-31 Clearing data acquisition and cleaning method based on data analysis Active CN117171157B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311421148.8A CN117171157B (en) 2023-10-31 2023-10-31 Clearing data acquisition and cleaning method based on data analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311421148.8A CN117171157B (en) 2023-10-31 2023-10-31 Clearing data acquisition and cleaning method based on data analysis

Publications (2)

Publication Number Publication Date
CN117171157A true CN117171157A (en) 2023-12-05
CN117171157B CN117171157B (en) 2024-01-16

Family

ID=88941629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311421148.8A Active CN117171157B (en) 2023-10-31 2023-10-31 Clearing data acquisition and cleaning method based on data analysis

Country Status (1)

Country Link
CN (1) CN117171157B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117453748A (en) * 2023-12-18 2024-01-26 青岛场外市场清算中心有限公司 Big data-based clearing data informatization management system
CN118071514A (en) * 2024-04-17 2024-05-24 青岛场外市场清算中心有限公司 Financial business data cloud interaction method, system and computer storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180051588A1 (en) * 2016-08-17 2018-02-22 General Electric Company System and method for gas turbine compressor cleaning
CN113342939A (en) * 2021-06-24 2021-09-03 中国平安人寿保险股份有限公司 Data quality monitoring method and device and related equipment
CN115881228A (en) * 2022-10-24 2023-03-31 蔓之研(上海)生物科技有限公司 Gene detection data cleaning method and system based on artificial intelligence
DE202023102287U1 (en) * 2023-04-27 2023-05-26 Vinaytosh Mishra An AI-based digital therapeutic system for diabetes education and lifestyle change
CN116735223A (en) * 2023-06-05 2023-09-12 东莞中电第二热电有限公司 Multi-parameter anomaly detection method for gas turbine
CN116797180A (en) * 2023-07-27 2023-09-22 中国电信股份有限公司技术创新中心 Complaint early warning method, complaint early warning device, computer equipment and storage medium
KR20230145731A (en) * 2022-04-11 2023-10-18 엘아이지넥스원 주식회사 Method and device for deriving preventive maintenance intervals through deduction of factors for reducing mirror reflectivity and analysis of influence

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180051588A1 (en) * 2016-08-17 2018-02-22 General Electric Company System and method for gas turbine compressor cleaning
CN113342939A (en) * 2021-06-24 2021-09-03 中国平安人寿保险股份有限公司 Data quality monitoring method and device and related equipment
KR20230145731A (en) * 2022-04-11 2023-10-18 엘아이지넥스원 주식회사 Method and device for deriving preventive maintenance intervals through deduction of factors for reducing mirror reflectivity and analysis of influence
CN115881228A (en) * 2022-10-24 2023-03-31 蔓之研(上海)生物科技有限公司 Gene detection data cleaning method and system based on artificial intelligence
DE202023102287U1 (en) * 2023-04-27 2023-05-26 Vinaytosh Mishra An AI-based digital therapeutic system for diabetes education and lifestyle change
CN116735223A (en) * 2023-06-05 2023-09-12 东莞中电第二热电有限公司 Multi-parameter anomaly detection method for gas turbine
CN116797180A (en) * 2023-07-27 2023-09-22 中国电信股份有限公司技术创新中心 Complaint early warning method, complaint early warning device, computer equipment and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
SAMPOORNAM K P等: ""All in one Integrated Farming IOT With Irrigation to Storage"", 《 2022 INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND INFORMATICS (ICCCI)》, pages 1 - 4 *
TRESNA MAULANA FAHRUDIN等: ""Feature selection algorithm using information gain based clustering for supporting the treatment process of breast cancer"", 《2016 INTERNATIONAL CONFERENCE ON INFORMATICS AND COMPUTING (ICIC)》, pages 1 - 4 *
刘攀等: ""新一代公路(道路)隧道机电设备综合管控***设计"", 《隧道建设》, pages 478 - 485 *
刘波等: ""面向 XML 数据库的智能数据清洗策略"", 《计算机工程》, pages 16 - 18 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117453748A (en) * 2023-12-18 2024-01-26 青岛场外市场清算中心有限公司 Big data-based clearing data informatization management system
CN117453748B (en) * 2023-12-18 2024-03-26 青岛场外市场清算中心有限公司 Big data-based clearing data informatization management system
CN118071514A (en) * 2024-04-17 2024-05-24 青岛场外市场清算中心有限公司 Financial business data cloud interaction method, system and computer storage medium

Also Published As

Publication number Publication date
CN117171157B (en) 2024-01-16

Similar Documents

Publication Publication Date Title
CN111459778B (en) Operation and maintenance system abnormal index detection model optimization method, device and storage medium
CN111475680A (en) Method, device, equipment and storage medium for detecting abnormal high-density subgraph
US20210124983A1 (en) Device and method for anomaly detection on an input stream of events
US20160219067A1 (en) Method of detecting anomalies suspected of attack, based on time series statistics
CN117171157B (en) Clearing data acquisition and cleaning method based on data analysis
CN105071983A (en) Abnormal load detection method for cloud calculation on-line business
US20090043536A1 (en) Use of Sequential Clustering for Instance Selection in Machine Condition Monitoring
WO2023044770A1 (en) Dry pump downtime early warning method and apparatus, electronic device, storage medium, and program
WO2019200739A1 (en) Data fraud identification method, apparatus, computer device, and storage medium
CN110119756B (en) Automatic trend data feature selection method based on voting method
CN112801316A (en) Fault positioning method, system equipment and storage medium based on multi-index data
CN116089405A (en) Power consumption data outlier detection and cleaning method based on DBSCAN and KNN algorithms
CN111198979A (en) Method and system for cleaning big data for power transmission and transformation reliability evaluation
Wang et al. An artificial immune and incremental learning inspired novel framework for performance pattern identification of complex electromechanical systems
CN110874601B (en) Method for identifying running state of equipment, state identification model training method and device
JP2007164346A (en) Decision tree changing method, abnormality determination method, and program
CN114518988B (en) Resource capacity system, control method thereof, and computer-readable storage medium
CN115330362A (en) Engineering progress data processing method and system
CN113705625A (en) Method and device for identifying abnormal life guarantee application families and electronic equipment
CN115758336A (en) Asset identification method and device
CN115392710A (en) Wind turbine generator operation decision method and system based on data filtering
CN115858606A (en) Method, device and equipment for detecting abnormity of time series data and storage medium
CN112882854B (en) Method and device for processing request exception
CN113535458A (en) Abnormal false alarm processing method and device, storage medium and terminal
US8930362B2 (en) System and method for streak discovery and prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant