CN111400288A - Data quality inspection method and system - Google Patents

Data quality inspection method and system Download PDF

Info

Publication number
CN111400288A
CN111400288A CN201910000892.8A CN201910000892A CN111400288A CN 111400288 A CN111400288 A CN 111400288A CN 201910000892 A CN201910000892 A CN 201910000892A CN 111400288 A CN111400288 A CN 111400288A
Authority
CN
China
Prior art keywords
data
quality
inspection
component
quality inspection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910000892.8A
Other languages
Chinese (zh)
Inventor
吴嘉
董晓荔
陈曦
李宝磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Communications Ltd Research Institute filed Critical China Mobile Communications Group Co Ltd
Priority to CN201910000892.8A priority Critical patent/CN111400288A/en
Publication of CN111400288A publication Critical patent/CN111400288A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • General Factory Administration (AREA)

Abstract

The invention provides a data quality inspection method and a data quality inspection system, and belongs to the technical field of data management and control. In the data quality inspection system, the monitoring component is used for calling the data acquisition component and the quality inspection component; the data acquisition component is used for acquiring the inspection objects in batch or in real time and storing the inspection objects into the data storage component; the quality inspection component is used for acquiring quality rules from the quality rule base, acquiring inspection objects from the data storage component, and performing quality inspection on the inspection objects according to the quality rules, wherein the types of the quality rules are matched with the data types of the inspection objects; the monitoring component is also used for recording the inspection result and the abnormal data of the quality inspection component and returning the inquiry result according to the inquiry command input by the user. By the technical scheme of the invention, the real-time quality monitoring can be automatically carried out on mass data.

Description

Data quality inspection method and system
Technical Field
The present invention relates to the field of data management and control technologies, and in particular, to a data quality inspection method and system.
Background
Data Quality Management (Data Quality Management) refers to a series of Management activities such as identification, measurement, monitoring, early warning and the like for various Data Quality problems which may be caused in each stage of a planning, obtaining, storing, sharing, maintaining, applying and eliminating life cycle of Data, and the Data Quality is further improved by improving and improving the Management level of an organization. High quality data, i.e., data that is accurate, consistent, and timely available, is an indispensable factor for today's organizational management. Organizations must struggle to identify data relevant to their decision making in order to develop business strategies and practices that ensure data accuracy and completeness, and to facilitate enterprise-wide data sharing.
The primary task of improving data quality is to define a set of standardized data specifications, to specify the definition, caliber, format, value, unit and the like of a specific data item, to form specific quality requirements for the data item, to rely on the set of specifications as a scale for measuring and improving data quality, to perform preventive or monitoring check on key data items in each link of data acquisition, processing and application, and to expose each system data quality problem. For example, all business rules, field-level data integrity constraints in E-R graphs and other documents are fully analyzed and documented.
In a traditional data warehouse environment based on technologies such as large-scale parallel processing (MPP) and high-performance all-in-one machines, since processing of each service data is performed in a batch manner, collection, processing, summarization and processing of the service data are performed regularly (generally in units of days). Therefore, data quality inspection is also generally performed in batches on a daily basis, and key data quality reports of each system are periodically generated to grasp the system data quality status.
Under the background of the big data era, data, namely assets, has the characteristics of large data volume, multiple data types, strong real-time requirement and high value of the data. Taking unified DPI data collected by Deep Packet Inspection (DPI) technology as an example, 1 ten thousand evolution base stations (ENB) have a storage capacity of 8.5T by 1 day External data representation (XDR), and service characteristics of data processing require real-time storage processing for 24 hours, it is obvious that a mode of inspecting XDR data quality by using a conventional batch processing technical scheme cannot meet requirements in timeliness.
The processing manner of data quality management under the conventional data warehouse is shown in fig. 1. In this mode, the collection of the service data is performed in a batch manner. For example, the system collects business data of the current day increment or the total amount before 24 points every day, loads the business data into a relational database, and simultaneously analyzes the business rules into quality rules to perform quality check one by one. Under the technical scheme, a certain time delay exists in the data quality checking result, and due to the increase of the data quantity and the increase of the complexity of the data quality checking rule, the processing time required by the data quality checking operation is rapidly increased, and even the situation that the data in the previous day is not processed and the new data arrives in the next day may occur.
Therefore, under a big data environment, the data volume has been greatly increased, and in addition, the diversification of data types, the mixing of unstructured data, internal and external data, multiple data types and strong real-time requirements are realized. The traditional data quality management mode is difficult to meet the management requirements of data complexity, high efficiency of quality check, innovativeness of data operation and the like.
Disclosure of Invention
The invention aims to provide a data quality inspection method and a data quality inspection system, which can automatically perform real-time quality monitoring on mass data.
To solve the above technical problem, embodiments of the present invention provide the following technical solutions:
the embodiment of the invention provides a data quality inspection system, which comprises a monitoring component, a data acquisition component, a quality inspection component, a quality rule base and a data storage component, wherein the data storage component adopts a distributed file system (HDFS);
the monitoring component is used for calling the data acquisition component and the quality inspection component;
the data acquisition component is used for acquiring the inspection objects in batch or in real time and storing the inspection objects into the data storage component;
the quality inspection component is used for acquiring quality rules from the quality rule base, acquiring inspection objects from the data storage component, and performing quality inspection on the inspection objects according to the quality rules, wherein the types of the quality rules are matched with the data types of the inspection objects;
the monitoring component is also used for recording the inspection result and the abnormal data of the quality inspection component and returning the inquiry result according to the inquiry command input by the user.
Further comprises a data cleaning component,
the monitoring component is also used for calling the data cleaning component to clean the historical data.
Further, the quality inspection component further comprises a quality rule engine for identifying and parsing the acquired quality rules.
Further, the data collection component is specifically configured to mark the data type of the data as structured data, unstructured data, or streaming data and store the marked data according to the source of the data and format information of the data after the collection of the single data is completed.
Further, the quality rule base is specifically configured to obtain a technical rule or a service rule in a service system or a management platform, obtain change information of the rule in real time, and convert the technical rule or the service rule into the quality rule by parsing, where a check type of the quality rule includes at least one of: integrity, normalization, consistency, accuracy, uniqueness, relevance.
Further, the quality inspection component is specifically used for selecting different processing modes according to the type of the data annotation and the inspection type of the quality rule, wherein the processing modes comprise text processing, Spark SQ L processing and Spark streaming processing.
Further, the text processing is used for quality check of completeness, accuracy and uniqueness;
the Spark SQ L processes quality checks for consistency, accuracy, relevance;
the Spark Streaming is in a quality check for traffic data with a timeliness requirement greater than a first threshold, or traffic data throughput greater than a second threshold.
Further, the collection of the business rules is acquired through a Kafka message service queue.
The embodiment of the invention also provides a data quality inspection method, which is applied to the data quality inspection system and comprises the following steps:
acquiring inspection objects in batch or in real time, and storing the inspection objects;
acquiring a quality rule from a quality rule base, and performing quality inspection on the inspection object according to the quality rule, wherein the type of the quality rule is matched with the data type of the inspection object;
and recording the checking result and the abnormal data, and returning the query result according to the query instruction input by the user.
Further, still include:
and cleaning the historical data.
Further, still include:
and summarizing and calculating the inspection results by using map or reduce of Spark in a text mode, or importing the details of the inspection results into a database to summarize SQ L or aggregate NoSQ L.
Further, still include:
when the data volume of the abnormal data is larger than or equal to a third threshold and all abnormal data need to be inquired, storing the abnormal data on a data storage component; and when the data volume of the abnormal data is smaller than the third threshold or only the abnormal data of the sample needs to be inquired, writing the abnormal data into a database server of the monitoring component.
An embodiment of the present invention further provides a data quality inspection apparatus, including: a memory, a processor and a computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, carries out the steps of the data quality checking method as described above.
An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps in the data quality inspection method described above.
The embodiment of the invention has the following beneficial effects:
in the scheme, an efficient, universal and automatic data quality inspection system is constructed, and a scheme for automatically matching quality inspection processing is provided according to the data type and the quality rule type of an inspection object, so that the universality and expandability of the system are improved, and the standard, universal and automatic data quality inspection system is realized; according to the technical scheme, the timeliness of the data quality management system is greatly improved, TB-level large-scale, high-performance and high-timeliness data quality inspection in a large data platform environment can be borne, and the availability and timeliness of the data quality management system are improved.
Drawings
FIG. 1 is a schematic diagram of a data quality management process in a conventional data warehouse;
FIG. 2 is an interaction diagram of components of a data quality inspection system according to an embodiment of the present invention;
FIG. 3 is a flow chart illustrating a data quality inspection method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a processing method of a data quality inspection system according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a technical architecture of a quality rules engine according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages to be solved by the embodiments of the present invention clearer, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.
The names and abbreviations of the terms related to the present invention may be changed correspondingly, and the technical solution of the present invention is still applicable when the abbreviations are changed.
The embodiment of the invention provides a data quality inspection method and a data quality inspection system, which can automatically perform real-time quality monitoring on mass data.
The embodiment of the invention provides a data quality inspection system, which comprises a monitoring component, a data acquisition component, a quality inspection component, a quality rule base and a data storage component, wherein the data storage component adopts a distributed file system (HDFS);
the monitoring component is used for calling the data acquisition component and the quality inspection component;
the data acquisition component is used for acquiring the inspection objects in batch or in real time and storing the inspection objects into the data storage component;
the quality inspection component is used for acquiring quality rules from the quality rule base, acquiring inspection objects from the data storage component, and performing quality inspection on the inspection objects according to the quality rules, wherein the types of the quality rules are matched with the data types of the inspection objects;
the monitoring component is also used for recording the inspection result and the abnormal data of the quality inspection component and returning the inquiry result according to the inquiry command input by the user.
In the embodiment, an efficient, universal and automatic data quality inspection system is constructed, and a scheme for automatically matching quality inspection processing is provided according to the data type and the quality rule type of an inspection object, so that the universality and expandability of the system are improved, and the standard, universal and automatic data quality inspection system is realized; according to the technical scheme, the timeliness of the data quality management system is greatly improved, TB-level large-scale, high-performance and high-timeliness data quality inspection in a large data platform environment can be borne, and the availability and timeliness of the data quality management system are improved.
Further, the monitoring component may be specifically divided into a scheduling component, a master controller, and a monitoring platform, as shown in fig. 2, in a specific embodiment, the data quality inspection system includes a scheduling component 11, a master controller 12, a data acquisition component 13, a quality inspection component 14, a quality rule base 17, a data storage component 16, and a monitoring platform 18, where the data storage component 16 employs a distributed file system HDFS;
the scheduling component 11 is configured to send a scheduling instruction to the master controller according to a preset time or a preset condition, and trigger to start the master controller 12;
the master controller 12 is used for calling the quality inspection component 14 to perform quality inspection; calling the data acquisition component 13, acquiring the inspection objects in batch or in real time, and storing the inspection objects in the data storage component 16;
the quality inspection component 14 is configured to obtain a quality rule from the quality rule base 17, obtain an inspection object from the data storage component 16, and perform quality inspection on the inspection object according to the quality rule, where the type of the quality rule matches the data type of the inspection object;
the monitoring platform 18 is configured to record the inspection result and the abnormal data of the quality inspection component 14, and return a query result according to a query instruction input by a user.
Further, as shown in fig. 2, the data quality inspection system further includes a data cleansing component 15,
the general controller 12 is also used for calling the data cleaning component 15 to clean the historical data.
Further, the quality check component 14 further comprises a quality rule engine for identifying and parsing the obtained quality rules.
Further, the data collection component 13 is specifically configured to mark the data type of the data as structured data, unstructured data, or streaming data and store the marked data according to the source of the data and the format information of the data after the collection of the single data is completed.
Further, the quality rule base 17 is specifically configured to obtain a technical rule or a service rule in a service system or a management platform, obtain change information of the rule in real time, and convert the technical rule or the service rule into the quality rule through analysis, where a check type of the quality rule includes at least one of: integrity, normalization, consistency, accuracy, uniqueness, relevance.
Further, the quality inspection component 14 is specifically configured to select different processing manners according to the type of the data annotation and the inspection type of the quality rule, where the processing manners include text processing, Spark SQ L processing, and Spark streaming processing.
Further, the text processing is used for quality check of completeness, accuracy and uniqueness;
the Spark SQ L processes quality checks for consistency, accuracy, relevance;
the Spark Streaming is in a quality check for traffic data with a timeliness requirement greater than a first threshold, or traffic data throughput greater than a second threshold.
Further, the collection of the business rules is acquired through a Kafka message service queue.
An embodiment of the present invention further provides a data quality inspection method, which is applied to the data quality inspection system described above, and as shown in fig. 3, the method includes:
step 101: acquiring inspection objects in batch or in real time, and storing the inspection objects; (ii) a
Step 102: acquiring a quality rule from a quality rule base, and performing quality inspection on the inspection object according to the quality rule, wherein the type of the quality rule is matched with the data type of the inspection object;
step 103: and recording the checking result and the abnormal data, and returning the query result according to the query instruction input by the user.
In the embodiment, an efficient, universal and automatic data quality inspection scheme is constructed, and a scheme for automatically matching quality inspection processing is provided according to the data type and the quality rule type of an inspection object, so that the universality and the expandability of the scheme are improved; according to the technical scheme, the timeliness of the data quality management system is greatly improved, TB-level large-scale, high-performance and high-timeliness data quality inspection in a large data platform environment can be borne, and the availability and timeliness of the data quality management system are improved.
Further, the data quality inspection method further includes:
and calling the data cleaning component to clean the historical data.
Further, the data quality inspection method further includes:
and summarizing and calculating the inspection results by using map or reduce of Spark in a text mode, or importing the details of the inspection results into a database to summarize SQ L or aggregate NoSQ L.
Further, the data quality inspection method further includes:
when the data volume of the abnormal data is larger than or equal to a third threshold and all abnormal data need to be inquired, storing the abnormal data on a data storage component; and when the data volume of the abnormal data is smaller than the third threshold or only the abnormal data of the sample needs to be inquired, writing the abnormal data into a database server of the monitoring component.
The steps of the data quality inspection method can be implemented by the data quality inspection system.
The data quality inspection scheme of the present invention is further described with reference to the following drawings and specific embodiments:
the embodiment makes full use of a Hadoop distributed system architecture, adopts message service to support real-time data flow, uses HDFS to support storage of mass data, applies Spark framework to realize efficient parallel processing of mass data, and establishes a data link between a processing platform and a monitoring platform through DBConnector, thereby establishing a real-time, automatic, highly available, high-performance, highly extensible and automatic mass data real-time quality monitoring system.
The processing manner of the data quality inspection system of the present embodiment is shown in fig. 4. The real-time data quality inspection system provided by the embodiment abandons the use of a traditional relational database on the basis of the original assembly, realizes the storage of mass data by using the HDFS, and builds a universal quality rule engine by means of a Spark memory computing framework, thereby realizing the efficient parallel processing of the mass data. Meanwhile, Kafka, Flume and other message queue services can be integrated, the mapping of service data and the checking of quality rules can be completed in a streaming mode, and the real-time quality checking of mass data can be really realized.
The overall processing flow and the interaction process of each component of the data quality inspection system of the embodiment are shown in fig. 2, and the specific interaction steps are described as follows:
step 1, a scheduling component 11 triggers and starts a master controller 12 according to time or condition scheduling;
step 2, the master controller 12 firstly calls up the data acquisition component 13, namely, acquires the inspection objects in batch or in real time through a data acquisition interface provided by the IT system;
step 3, storing the inspection object into the HDFS for data preparation of subsequent operation quality inspection treatment;
step 4, calling up the quality inspection component 14 after the master controller 12;
step 5, the quality inspection component 14 firstly obtains the quality inspection rule from the quality rule base 17, and then delivers the quality inspection rule to the quality rule engine for identification and analysis;
step 6, the quality inspection component 14 acquires the inspection object from the data storage component 16;
step 7, the quality inspection component 14 applies the quality rules to the inspection object one by one for processing;
step 8, recording the processing result (including quality inspection result and abnormal data) of the quality rule engine in the monitoring platform 18, and displaying the quality inspection result and inquiring the abnormal data to the user by the monitoring platform 18;
step 9, after the quality inspection is finished, the master controller 12 cleans the historical data according to a data cleaning strategy;
and step 10, the user inquires the quality inspection result, the detail abnormal data and the quality report through the monitoring platform.
The implementation of the key components is detailed as follows:
1. data acquisition and data annotation
The service data can be divided into two types according to the data source, one type is batch data and is realized by using a traditional file transfer protocol; the other is streaming data and can be implemented by using Kafka, flash and other message queues. After the single data is acquired, the data acquisition component 13 divides the data into three types according to the information of the data source, the data format and the like, and stores the data after marking.
The service data uses the HDFS to meet the storage requirement of mass data. According to the architecture of the enterprise-level IT system, the method is divided into the following two modes:
a. using enterprise-level Hadoop file system
If an enterprise already builds an enterprise-level Hadoop file system (HDFS), such as an enterprise-level big data platform and the like, and at the moment, business data is stored on the platform and is updated regularly, the process that the system collects the business data again can be omitted, and original business data are directly read from the enterprise-level HDFS for the data quality inspection operation platform to use.
b. Hadoop file system using application level
If the enterprise-level HDFS is not used, or only data in a part of service fields or service data are sampled and checked according to service requirements, an application-level HDFS can be selected to be built, and the service data are copied to the application HDFS for storage through HDFS put, HDFS distcp and other modes.
2. Quality rule parsing
The quality rules are derived from technical or business rules in a business system or management platform. The quality rule base 17 obtains the change information of the rules in real time through the pushed message service, and converts the technical rules or the business rules into the quality rules through analysis, wherein the quality rules can be classified into integrity, normalization, consistency, accuracy, uniqueness, relevance and the like according to the types of the inspection. The analyzed quality rules are the calculation basis for the subsequent universal quality rule engine to process the service data, thereby realizing the real-time performance of the quality inspection.
3. Quality rules engine
The input of the quality rules engine is divided into two parts, one part is the business data output by the data acquisition component 13, and the other part is the quality rules output by the quality rules library 17. The processing logic of the quality rule engine is to apply the quality rules to the service data, complete the quality check of the service data, and output the quality check result. Since the service data are labeled according to structured, unstructured and streaming data in the data acquisition component 13, different processing modes are dynamically selected according to the type of the service data label and the type of the quality rule, and the data quality inspection is automatically completed. The method specifically comprises the following steps:
Figure BDA0001933547040000101
the method comprises the steps of checking integrity of structured data through text processing, checking normalization through text processing, checking consistency through Spark SQ L processing, checking accuracy through Spark SQ L processing, checking uniqueness through text processing, checking relevance through Spark SQ L processing, checking integrity of unstructured data through text processing, checking normalization through text processing, checking consistency through text processing, checking accuracy through text processing, checking uniqueness through text processing, checking relevance through text processing, checking integrity of Streaming data through Spark Streaming processing, checking normalization through Spark Streaming processing, checking consistency through Spark SQ L processing, checking accuracy through Spark SQ L processing, checking uniqueness through Spark Streaming processing, checking SQ L relevance through Spark processing, and checking relevance through Spark SQ 3838 processing
The technical implementation of the quality metric rule engine is described in detail below.
The universal quality rule engine is built by using a Spark memory technology framework. Spark is a fast, general-purpose computing engine designed specifically for large-scale data processing. Spark has the advantages of Hadoop MapReduce, but is different from MapReduce, the intermediate output result of Job can be stored in a memory, so that HDFS reading and writing is not needed, and the Spark can greatly improve the data inspection processing efficiency. In addition, by means of integration with resource scheduling tools such as Yarn and the like, Spark can run in parallel in a Hadoop file system (HDFS), the HDFS is supported natively, a computing node of the Spark supports flexible expansion, and large-scale data processing is supported by using the characteristic that a large amount of cheap computing resources are concurrent.
The technical architecture of the quality rules engine is shown in fig. 5.
The Spark workflow can be summarized into three steps, that is, creating concurrent tasks, performing transformation operations on data, such as map (comparison), filter (filtering), unity (union), intersectant (intersection), and the like, and then performing operations, such as reduce (reduction), count (count), or simply collecting results.
The processing mode of the service data is divided into the following three types, namely, text processing and simple checking, Spark SQ L processing and regular checking, Spark Streaming processing and Streaming data checking, and the processing mode is specifically as follows:
(1) text processing
The way the text is processed is used for quality checking in terms of completeness, accuracy, uniqueness, e.g. whether the field type length is legal, whether the field is void of value illegal, whether the field value range is illegal, etc. The application scene has the characteristics that the quality inspection can be completed only by analyzing the data without data association. Therefore, in the processing mode, the service data is mapped to RDD, and the quality check is performed by using the map, the filter and other operations of Spark. However, because specific map, filter and reduce processes need to be written for each quality gauge, the method is also suitable for unstructured data, and because methods such as native map are used, operations such as data association are generally not needed, so that shuffle is avoided, and the execution efficiency is high.
(2) Spark SQ L processing
The method of Spark SQ L is suitable for quality check in terms of consistency, accuracy and relevance, such as external key association violation, association violation among a plurality of data values and the like, the application scene is characterized in that logical calculation, comparison and other operations are required to be carried out among a plurality of data, therefore, in a processing mode, business data are mapped into Dataset, Schema information is added and then the business data are converted into a temporary data table or a global data table, and the quality rules are executed in a SQ L query mode.
(3) Spark Streaming treatment
The method is suitable for quality rules with extremely high requirements on timeliness or application scenes with extremely high business data throughput. In the processing mode, data is acquired in a streaming mode through message services such as Kafaka and flash, and service data is mapped into DStream for processing. In addition, due to the characteristics of streaming data, this method is not suitable for quality check of data association required for consistency and the like.
The technical comparison of the several treatment modes is as follows:
Figure BDA0001933547040000121
4. quality monitoring
The quality monitoring component of the embodiment mainly comprises a quality rule analysis part, a quality inspection result detail collection part and an abnormal business data query part. The method comprises the following specific steps:
(1) quality rule parsing
In order to meet the requirement of real-time quality inspection, the acquisition of the business rules is acquired through a message service queue such as Kafka and the like, and the business rules are analyzed into quality rules in real time on a quality operation platform for being called by a quality rule engine.
(2) Summary of quality details
In the real-time streaming data processing mode, the original service data is fragmented or partitioned, and the processing result granularity of the quality rule engine is fine, so in this case, the inspection result details need to be summarized to a certain extent according to the dimensions such as time and inspection objects, and the processing mode can be summarized and calculated according to a text mode by using maps and reduce of Spark, and can also be summarized and calculated by importing the inspection result details into a database (SQ L) or aggregating (NoSQ L).
(3) Abnormal data query
The abnormal data refers to an illegal data set which is generated by quality verification and does not conform to quality rules. The storage of abnormal data is divided into the following two modes according to application scenarios: for an application scene with large data volume and requiring to query all abnormal data, the quality operation platform directly stores the abnormal data on the HDFS, and the latter provides data query service for the quality monitoring platform through a data access interface provided by the HDFS; for an application scenario with a small data volume or requiring only to query sample abnormal data, the abnormal data can be directly written into a database server of a quality monitoring platform through a DB Connector while the abnormal data is generated, so that abnormal data query service is provided.
The quality operation platform provided by the embodiment can be expanded according to the actual service scene and application requirements while keeping seamless compatibility to the original quality inspection rule of an enterprise based on SQ L, and can support not only structured data, but also unstructured data and streaming data.
An embodiment of the present invention further provides a data quality inspection apparatus, including: a memory, a processor and a computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, carries out the steps of the data quality checking method as described above.
An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps in the data quality inspection method described above.
For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable logic devices (P L D), Field-Programmable Gate arrays (FPGAs), general purpose processors, controllers, microcontrollers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, user equipment (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or user equipment that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or user equipment. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or user equipment that comprises the element.
While the preferred embodiments of the present invention have been described, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the following claims.

Claims (14)

1. A data quality inspection system is characterized by comprising a monitoring component, a data acquisition component, a quality inspection component, a quality rule base and a data storage component, wherein the data storage component adopts a distributed file system (HDFS);
the monitoring component is used for calling the data acquisition component and the quality inspection component;
the data acquisition component is used for acquiring the inspection objects in batch or in real time and storing the inspection objects into the data storage component;
the quality inspection component is used for acquiring quality rules from the quality rule base, acquiring inspection objects from the data storage component, and performing quality inspection on the inspection objects according to the quality rules, wherein the types of the quality rules are matched with the data types of the inspection objects;
the monitoring component is also used for recording the inspection result and the abnormal data of the quality inspection component and returning the inquiry result according to the inquiry command input by the user.
2. The data quality inspection system of claim 1, further comprising a data cleansing component,
the monitoring component is also used for calling the data cleaning component to clean the historical data.
3. The data quality inspection system of claim 1, wherein the quality inspection component further comprises a quality rules engine to identify and parse the obtained quality rules.
4. The data quality inspection system of claim 1,
the data acquisition component is specifically used for marking the data type of the data as structured data, unstructured data or streaming data and then storing the data according to the data source and the format information of the data after the single data is acquired.
5. The data quality inspection system of claim 4,
the quality rule base is specifically used for acquiring technical rules or business rules in a business system or a management platform, acquiring change information of the rules in real time, and converting the technical rules or the business rules into the quality rules through analysis, wherein the inspection types of the quality rules include at least one of the following types: integrity, normalization, consistency, accuracy, uniqueness, relevance.
6. The data quality inspection system of claim 5, wherein the quality inspection component is specifically configured to select different processing modes according to the type of the data label and the inspection type of the quality rule, and the processing modes include text processing, Spark SQ L processing, and Spark Streaming processing.
7. The data quality inspection system of claim 6, wherein the text processing is used for quality inspection of integrity, accuracy, uniqueness;
the Spark SQ L processes quality checks for consistency, accuracy, relevance;
the Spark Streaming is in a quality check for traffic data with a timeliness requirement greater than a first threshold, or traffic data throughput greater than a second threshold.
8. The data quality inspection system of claim 5,
and the collection of the business rules is acquired through a Kafka message service queue.
9. A data quality inspection method, comprising:
acquiring inspection objects in batch or in real time, and storing the inspection objects;
acquiring a quality rule from a quality rule base, and performing quality inspection on the inspection object according to the quality rule, wherein the type of the quality rule is matched with the data type of the inspection object;
and recording the checking result and the abnormal data, and returning the query result according to the query instruction input by the user.
10. The data quality inspection method according to claim 9, further comprising:
and cleaning the historical data.
11. The data quality inspection method according to claim 9, further comprising:
and summarizing and calculating the inspection results by using map or reduce of Spark in a text mode, or importing the details of the inspection results into a database to summarize SQ L or aggregate NoSQ L.
12. The data quality inspection method according to claim 9, further comprising:
when the data volume of the abnormal data is larger than or equal to a third threshold and all abnormal data need to be inquired, storing the abnormal data on a data storage component; and when the data volume of the abnormal data is smaller than the third threshold or only the abnormal data of the sample needs to be inquired, writing the abnormal data into a database server of the monitoring component.
13. A data quality inspection apparatus characterized by comprising: memory, processor and computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, carries out the steps in the data quality checking method according to any one of claims 9 to 12.
14. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps in the data quality inspection method according to any one of claims 9 to 12.
CN201910000892.8A 2019-01-02 2019-01-02 Data quality inspection method and system Pending CN111400288A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910000892.8A CN111400288A (en) 2019-01-02 2019-01-02 Data quality inspection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910000892.8A CN111400288A (en) 2019-01-02 2019-01-02 Data quality inspection method and system

Publications (1)

Publication Number Publication Date
CN111400288A true CN111400288A (en) 2020-07-10

Family

ID=71432017

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910000892.8A Pending CN111400288A (en) 2019-01-02 2019-01-02 Data quality inspection method and system

Country Status (1)

Country Link
CN (1) CN111400288A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111641532A (en) * 2020-03-30 2020-09-08 北京红山信息科技研究院有限公司 Communication quality detection method, device, server and storage medium
CN112162980A (en) * 2020-11-26 2021-01-01 成都数联铭品科技有限公司 Data quality control method and system, storage medium and electronic equipment
CN112579352A (en) * 2020-12-14 2021-03-30 广州信安数据有限公司 Quality monitoring result generation method, storage medium and quality monitoring system of service data processing link
CN113010566A (en) * 2021-03-31 2021-06-22 建信金融科技有限责任公司 Batch processing result checking method and device
CN113157745A (en) * 2021-04-28 2021-07-23 上海交大慧谷通用技术有限公司 Data quality detection method and system
CN113242157A (en) * 2021-05-08 2021-08-10 国家计算机网络与信息安全管理中心 Centralized data quality monitoring method under distributed processing environment
CN115994194A (en) * 2023-03-23 2023-04-21 河北东软软件有限公司 Method, system, equipment and medium for checking data quality of government affair big data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014163624A1 (en) * 2013-04-02 2014-10-09 Hewlett-Packard Development Company, L.P. Query integration across databases and file systems
CN106874483A (en) * 2017-02-20 2017-06-20 山东鲁能软件技术有限公司 A kind of device and method of the patterned quality of data evaluation and test based on big data technology
CN107545349A (en) * 2016-06-28 2018-01-05 国网天津市电力公司 A kind of Data Quality Analysis evaluation model towards electric power big data
CN107958049A (en) * 2017-11-28 2018-04-24 航天科工智慧产业发展有限公司 A kind of quality of data checking and administration system
US20180173733A1 (en) * 2016-12-19 2018-06-21 Capital One Services, Llc Systems and methods for providing data quality management
CN108846076A (en) * 2018-06-08 2018-11-20 山大地纬软件股份有限公司 The massive multi-source ETL process method and system of supporting interface adaptation
US20180341956A1 (en) * 2017-05-26 2018-11-29 Digital River, Inc. Real-Time Web Analytics System and Method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014163624A1 (en) * 2013-04-02 2014-10-09 Hewlett-Packard Development Company, L.P. Query integration across databases and file systems
CN107545349A (en) * 2016-06-28 2018-01-05 国网天津市电力公司 A kind of Data Quality Analysis evaluation model towards electric power big data
US20180173733A1 (en) * 2016-12-19 2018-06-21 Capital One Services, Llc Systems and methods for providing data quality management
CN106874483A (en) * 2017-02-20 2017-06-20 山东鲁能软件技术有限公司 A kind of device and method of the patterned quality of data evaluation and test based on big data technology
US20180341956A1 (en) * 2017-05-26 2018-11-29 Digital River, Inc. Real-Time Web Analytics System and Method
CN107958049A (en) * 2017-11-28 2018-04-24 航天科工智慧产业发展有限公司 A kind of quality of data checking and administration system
CN108846076A (en) * 2018-06-08 2018-11-20 山大地纬软件股份有限公司 The massive multi-source ETL process method and system of supporting interface adaptation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
龙婧;刘伟;殷胜;: "基于机器学习的电网设备档案数据异常诊断研究" *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111641532A (en) * 2020-03-30 2020-09-08 北京红山信息科技研究院有限公司 Communication quality detection method, device, server and storage medium
CN111641532B (en) * 2020-03-30 2022-02-18 北京红山信息科技研究院有限公司 Communication quality detection method, device, server and storage medium
CN112162980A (en) * 2020-11-26 2021-01-01 成都数联铭品科技有限公司 Data quality control method and system, storage medium and electronic equipment
CN112579352A (en) * 2020-12-14 2021-03-30 广州信安数据有限公司 Quality monitoring result generation method, storage medium and quality monitoring system of service data processing link
CN113010566A (en) * 2021-03-31 2021-06-22 建信金融科技有限责任公司 Batch processing result checking method and device
CN113157745A (en) * 2021-04-28 2021-07-23 上海交大慧谷通用技术有限公司 Data quality detection method and system
CN113242157A (en) * 2021-05-08 2021-08-10 国家计算机网络与信息安全管理中心 Centralized data quality monitoring method under distributed processing environment
CN113242157B (en) * 2021-05-08 2022-12-09 国家计算机网络与信息安全管理中心 Centralized data quality monitoring method under distributed processing environment
CN115994194A (en) * 2023-03-23 2023-04-21 河北东软软件有限公司 Method, system, equipment and medium for checking data quality of government affair big data
CN115994194B (en) * 2023-03-23 2023-06-02 河北东软软件有限公司 Method, system, equipment and medium for checking data quality of government affair big data

Similar Documents

Publication Publication Date Title
CN111400288A (en) Data quality inspection method and system
US9356966B2 (en) System and method to provide management of test data at various lifecycle stages
CN110908641B (en) Visualization-based stream computing platform, method, device and storage medium
CN106293891B (en) Multidimensional investment index monitoring method
CN113360554B (en) Method and equipment for extracting, converting and loading ETL (extract transform load) data
CN112559475B (en) Data real-time capturing and transmitting method and system
CN109753596B (en) Information source management and configuration method and system for large-scale network data acquisition
CN110147470B (en) Cross-machine-room data comparison system and method
CN110457371A (en) Data managing method, device, storage medium and system
CN105630934A (en) Data statistic method and system
CN112905323A (en) Data processing method and device, electronic equipment and storage medium
CN112817958A (en) Electric power planning data acquisition method and device and intelligent terminal
CN114416703A (en) Method, device, equipment and medium for automatically monitoring data integrity
CN114429265A (en) Enterprise portrait service construction method, device and equipment based on grid technology
CN112328667B (en) Shale gas field ground engineering digital handover method based on data blood margin
CN112836124A (en) Image data acquisition method and device, electronic equipment and storage medium
CN113360581A (en) Data processing method, device and storage medium
CN113297245A (en) Method and device for acquiring execution information
CN113220530B (en) Data quality monitoring method and platform
CN115905371A (en) Data trend analysis method, device and equipment and computer readable storage medium
CN115470279A (en) Data source conversion method, device, equipment and medium based on enterprise data
US20170337644A1 (en) Data driven invocation of realtime wind market forecasting analytics
CN115168297A (en) Bypassing log auditing method and device
CN112148459B (en) Processing method, device, readable medium and equipment for node association data
CN113553320B (en) Data quality monitoring method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination