CN113157676A - Data quality management method, system, device and storage medium - Google Patents

Data quality management method, system, device and storage medium Download PDF

Info

Publication number
CN113157676A
CN113157676A CN202110401537.9A CN202110401537A CN113157676A CN 113157676 A CN113157676 A CN 113157676A CN 202110401537 A CN202110401537 A CN 202110401537A CN 113157676 A CN113157676 A CN 113157676A
Authority
CN
China
Prior art keywords
quality inspection
data
quality
module
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110401537.9A
Other languages
Chinese (zh)
Inventor
张迎峰
吴仲维
黎永昇
钟炳汉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unicom Guangdong Industrial Internet Co Ltd
Original Assignee
China Unicom Guangdong Industrial Internet Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unicom Guangdong Industrial Internet Co Ltd filed Critical China Unicom Guangdong Industrial Internet Co Ltd
Priority to CN202110401537.9A priority Critical patent/CN113157676A/en
Publication of CN113157676A publication Critical patent/CN113157676A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Factory Administration (AREA)

Abstract

The invention discloses a data quality management method, a system, a device and a storage medium, wherein the data quality management system comprises a service platform, an ETL platform, a scheduling tool and a data center, a user can initiate quality inspection service through the service platform, configure quality inspection rules by using the ETL platform, perform quality inspection on data provided by the data center, and schedule quality inspection tasks by using the scheduling tool. The invention also provides a data quality management method, which determines synchronous data according to the data source of the data center, determines quality inspection service according to the service platform and the ETL platform, performs field-level quality inspection on the synchronous data and determines a quality inspection report. According to the embodiment of the application, the quality inspection report is generated through field-level quality inspection of the synchronous data, the data field with problems is accurately positioned, the reason that the quality inspection cannot pass is conveniently analyzed according to the data field with problems, and the quality inspection quality is improved.

Description

Data quality management method, system, device and storage medium
Technical Field
The present application relates to the field of data quality management, and in particular, to a method, a system, an apparatus, and a storage medium for data quality management.
Background
Under the rapid development of modern information technology, a big data era comes, various data are explosively increased every day, and data resources are greatly enriched. Taking government departments to execute government affairs as an example, the government departments rely on massive data to perform regional population flow analysis, fertility rate statistics, employment rate statistics and other works, and in the process of fully utilizing various data resources, the government departments need to ensure that the data have higher quality level, so that the data resources can be really played. However, most government data sources are manually input, so that the data quality is poor; and due to the large data volume, the later manual check is also quite difficult.
In the related art, some monitoring platforms for data quality are provided, but the platforms have the defects of large system, multiple dependent components, troublesome deployment and the like. Most of the platforms are private to enterprises, are designed according to the business of the enterprises, and are not completely applicable to data of other industries.
Disclosure of Invention
The present application is directed to solving, at least to some extent, one of the technical problems in the related art. Therefore, the application provides a data quality management method, a system, a device and a storage medium.
In a first aspect, an embodiment of the present application provides a data quality management system, including a service platform, an ETL platform, a scheduling tool, and a data center; the business platform is used for providing quality inspection service; the ETL platform is used for configuring and managing quality inspection rules; the scheduling tool is used for scheduling quality inspection tasks; the data center is used for managing data sources and executing quality inspection tasks.
Optionally, the ETL platform includes a rule generation module, a variable management module, a detection module, and an instance management module; the rule definition module is used for predefining a quality inspection rule; the variable management module is used for managing time variables, and the time variables are used for determining quality inspection periods; the detection module is used for configuring the quality inspection rule, detecting a specific field of data according to the quality inspection rule and generating an example; the instance management module is used for managing the instances, and the instances at least comprise quality inspection logs and quality inspection reports.
Optionally, the service platform includes a quality inspection module, a rule management module, a file management module, and a report management module; the quality inspection module is used for initiating the quality inspection task; the rule management module is used for configuring field-level quality inspection rules; the file management module is used for managing a basis file; the report management module is used for managing the data quality inspection report.
Optionally, the data center comprises a data warehouse tool, a file storage module, and a data synchronization tool; the data warehouse tool is used for extracting, converting and loading data; the file storage module is used for storing data; the data synchronization tool is used for synchronizing data.
Optionally, the system further comprises an automatic quality inspection module and a third-party quality inspection module; the automatic quality inspection module is used for automatically inspecting the catalog hanging of the data; and the third-party quality inspection module is used for initiating quality inspection service by a third-party application program.
In a second aspect, an embodiment of the present application provides a data quality management method, to which the data quality management system of the first aspect is applied, where the method includes: determining synchronous data according to a data source of the data center; determining quality inspection service according to the business platform and the ETL platform; according to the quality inspection service, performing quality inspection on the synchronous data and determining a quality inspection report; the quality inspection service at least comprises a plurality of quality inspection fields and quality inspection rules corresponding to the quality inspection fields.
Optionally, the quality inspection report at least includes an overall quality inspection result, an overall quality inspection yield, a yield of each field, and a quality inspection problem list.
Optionally, the quality inspection rule at least includes null value verification, format verification, most value verification, value range verification, and record number verification.
In a third aspect, an embodiment of the present application provides an apparatus, including: at least one processor; at least one memory for storing at least one program; when the at least one program is executed by the at least one processor, causing the at least one processor to implement the data quality management method according to the second aspect.
In a fourth aspect, embodiments of the present application provide a computer storage medium in which a processor-executable program is stored, the processor-executable program being configured to implement the data quality management method according to the second aspect when executed by the processor.
The beneficial effects of the embodiment of the application are as follows: the utility model provides a data quality management system, this system includes service platform, ETL platform, scheduling tool and data center, and the user can launch quality testing service through service platform, and utilize the ETL platform configuration quality testing rule, and carry out the quality testing to the data that data center provided, and the system uses the scheduling tool to carry out the scheduling of quality testing task, accomplishes the quality testing of data, and this application embodiment can be according to the business needs autonomic configuration quality testing rule, improves the flexibility of data quality testing and the validity of quality testing. In addition, the embodiment of the application also provides a data quality management method, which is applied to the data quality management system. According to the embodiment of the application, the quality inspection report is generated through field-level quality inspection of the synchronous data, the data field with problems can be accurately positioned, the reason that the quality inspection cannot pass is conveniently analyzed according to the data field with problems, and the quality of the data quality inspection is effectively improved.
Drawings
The accompanying drawings are included to provide a further understanding of the claimed subject matter and are incorporated in and constitute a part of this specification, illustrate embodiments of the subject matter and together with the description serve to explain the principles of the subject matter and not to limit the subject matter.
Fig. 1 is a first schematic diagram of a data quality management system architecture provided by an embodiment of the present application;
FIG. 2 is a flow chart of steps of a data quality management method provided by an embodiment of the present application;
fig. 3 is a second schematic diagram of a data quality management system architecture provided by an embodiment of the present application;
fig. 4 is a device according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
It should be noted that although functional block divisions are provided in the system drawings and logical orders are shown in the flowcharts, in some cases, the steps shown and described may be performed in different orders than the block divisions in the systems or in the flowcharts. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Under the current big data era, for government departments and units, such as government departments, the data quality is poor because most government data sources are manually input; and because the data volume is large, the later manual check is also quite difficult, so that the realization of quality management on the data is very important. In the related art, some Data Quality monitoring platforms are provided, at present, main Quality detection platforms such as a Data Quality Center (DQC aribab Data Quality monitoring platform), a BDP (kyotong big Data Quality monitoring platform) and the like are not open to the outside, and the platforms are developed according to business requirements of enterprises and cannot be completely suitable for Data of other industries. However, there are some open-source quality monitoring platforms in the related art, for example, Apache Griffin, which is an open-source big data quality solution, although it can implement data quality monitoring, it is troublesome to deploy, large in system, too many in dependent components, not ideal in user experience, and does not support data sources such as Oracle, Mysql, Hive, and the like.
Based on the defects in the related technology, the embodiment of the application provides a data quality management system, which comprises a service platform, an ETL platform, a scheduling tool and a data center, wherein a user can initiate a quality inspection service through the service platform, configure a quality inspection rule by using the ETL platform, and perform quality inspection on data provided by the data center, and schedule a quality inspection task by using the scheduling tool to complete quality inspection of the data. In addition, the embodiment of the application also provides a data quality management method, which is applied to the data quality management system and comprises the steps of firstly determining synchronous data according to a data source, then determining a quality inspection field and a quality inspection rule which need quality inspection according to quality inspection service, carrying out field-level quality inspection on the synchronous data and determining a quality inspection report. According to the embodiment of the application, the quality inspection report is generated through field-level quality inspection of the synchronous data, the data field with problems can be accurately positioned, the reason that the quality inspection cannot pass is conveniently analyzed according to the data field with problems, and the quality of the data quality inspection is effectively improved.
The embodiments of the present application will be further explained with reference to the drawings.
Referring to fig. 1, fig. 1 is a first schematic diagram of a data quality management system architecture provided in an embodiment of the present application, where the system 100 includes: the service platform 110, the ETL platform 120, the scheduling tool 130, and the data center 140, the service platform 110 provides quality inspection service through an operation manner that is easily understood by a service person. Through the service platform, the embodiment of the application realizes the service configuration of data quality inspection, encapsulates the quality inspection service and realizes the capability call chain from the service platform to the data center from top to bottom. The ETL (Extract-Transform-Load) is used to describe a process of extracting (Extract), converting (Transform), and loading (Load) data from a source end to a destination end, and in this embodiment, the ETL platform 120 is used to configure quality inspection rules and perform quality inspection on the data according to the quality inspection rules. The scheduling tool 130 is used to schedule the quality inspection task, and enable the system to execute the quality inspection task or schedule the quality inspection task regularly, in this embodiment of the present application, the used scheduling tool may be a dolphin scheduler, which is a distributed, decentralized, and easily extensible visual DAG workflow task scheduling system, and is a component for running a data quality inspection bottom-layer logical task. The data center 140 is used to manage data sources, and the data center in this embodiment may be a data center based on a CDH (cloud's Distribution adding Apache Hadoop, a Hadoop version issued by cloud corporation) large data cluster, which is called a CDH data center.
Through the data quality management system shown in fig. 1, the embodiment of the application provides a service platform convenient for a user to operate, the user initiates a quality inspection service through the service platform, the system packages the quality inspection service and calls an ETL platform, configures a quality inspection rule and performs quality inspection on data of a data center, and a scheduling tool schedules the quality inspection service in the period. The embodiment of the application provides a set of autonomous data quality inspection system, and the quality inspection process of data is completed through the platform.
Due to the fact that business of each industry is different, corresponding quality inspection rules are different, and therefore accuracy of data obtained by quality inspection of a quality inspection platform based on the related technology cannot be guaranteed. Secondly, the purpose of performing quality inspection rules is to optimize the quality of the existing data, and if the data only passes or does not pass the detection, the problem data cannot be located, so that the quality inspection significance is not great.
Based on the above-mentioned deficiencies of the related art, the embodiments of the present application provide a data quality management method. Referring to fig. 2, fig. 2 is a flowchart illustrating steps of a data quality management method provided in an embodiment of the present application, where the method includes, but is not limited to, steps S200 to S220:
s200, determining synchronous data according to a data source of a data center;
specifically, in the data quality management system provided in the embodiment of the present application, the system bottom layer implements data synchronization and determines synchronization data based on the capabilities of the CDH data center and the dispatch tool dolphin scheduler. In addition, according to the data center and the scheduling tool, the embodiment of the application can also realize the storage, the scheduling and the calculation of data.
S210, determining quality inspection service according to the service platform and the ETL platform;
specifically, in the data quality management system provided in the embodiment of the present application, a service platform is set on an upper layer of the system, and a user initiates a quality inspection service using the service platform, where the quality inspection service at least includes a plurality of quality inspection fields and quality inspection rules corresponding to the quality inspection fields. The business platform is arranged on the upper layer of the system, a user can conveniently initiate a quality inspection task according to business requirements, the ETL platform is arranged on the bottom layer of the system and is used for configuring and managing quality inspection rules of the bottom layer, the business platform obtains the existing bottom layer quality inspection rules from the ETL platform, configures the quality inspection rules according to files and field levels and binds with the bottom layer quality inspection rules of the ETL platform to complete the configuration of the quality inspection task. Through field-level quality inspection rule configuration, the field of the designated data can be subjected to quality inspection, and the quality inspection of the designated field is realized, so that the data volume of each quality inspection is reduced, the quality inspection process is accelerated, and the quality inspection efficiency is improved; in addition, the realization of the quality inspection of the designated field is beneficial to accurately positioning the problematic data field, the quality inspection result is conveniently analyzed, and the quality inspection quality is improved.
S220, performing quality inspection on the synchronous data according to quality inspection service, and determining a quality inspection report;
specifically, the synchronous data is subjected to quality inspection according to a quality inspection field and a quality inspection rule in the quality inspection service. It should be noted that a plurality of fields may be subjected to quality inspection in parallel using different or the same quality inspection rules, respectively. And in the calculation process of the data quality inspection, task scheduling is carried out through a Dolphin scheduler, and data quality inspection is carried out in a CDH data center. And after the quality inspection is finished, determining a quality inspection report, and visually checking the result of the data quality inspection by a user through the quality inspection report. Because the embodiment of the application can realize field-level data quality inspection, the quality inspection report at least comprises an overall quality inspection result, an overall quality inspection qualified rate, qualified rates of all fields and a quality inspection problem list.
Through steps S200 to S220, the embodiment of the present application provides a data quality management method, which is applied to the data quality management system, and is implemented by determining, according to a data source of a data center, synchronous data, and then determining, according to a service platform and an ETL platform, a quality inspection service, which at least includes a quality inspection field and a quality inspection rule, performing field-level quality inspection on the synchronous data, and determining a quality inspection report. According to the embodiment of the application, the quality inspection report is generated through field-level quality inspection of the synchronous data, the data field with problems can be accurately positioned, the reason that the quality inspection cannot pass is conveniently analyzed according to the data field with problems, and the quality of the data quality inspection is effectively improved.
Referring to fig. 3, fig. 3 is a second schematic diagram of a data quality management system architecture provided in the embodiment of the present application, and it should be noted that fig. 3 uses the same reference numerals as those used in fig. 1 to refer to the same modules, and both fig. 1 and fig. 3 use reference numeral 110 to refer to a service platform. In addition, fig. 3 also includes an automatic quality inspection module 310 and a third party quality inspection module 320. The service platform comprises a quality inspection module 111, a rule management module 112, a file management module 113 and a report management module 114, wherein the quality inspection module is used for initiating a quality inspection task; the rule management module is used for configuring field-level quality inspection rules; the file management module is used for managing the basis files; the report management module is used for managing data quality inspection reports.
The file management module is used for managing the files. The basis file refers to a service basis uploaded by a user or a text, and the text can be an internal rule file of an enterprise or a file issued by an official party. The user can set corresponding quality inspection rules according to the files. Illustratively, the document may be "national name standard" issued by the police. The file management module realizes the whole process management of basic information, one-to-many attachment uploading and operation logs according to files, and the managed contents include but are not limited to searching, adding, editing, deleting, checking and uploading the files according to files, and also include the record number of recording operation logs, obtaining time searching options and the like. When uploading a base file, the base information of the base file needs to be perfected, and the base information field comprises a file name, a file number, a release unit, a file type and file effective time. Depending on the file being unrepeatable, the user may set certain file numbering rules, for example, a rule of "YJA" +6 year and month +4 order number to determine the file number. Illustratively, when the file management module implements a search function for files, a search is performed according to at least one field of a file number, a file name, a publishing unit, a file category and a creation time, taking a creation time as an example, the creation time may include 5 options of all, last week, last month, last march, and self-definition, and then the service platform displays records according to files according to the creation time field. It can be understood that the file management module may record an operation log, and the content of the operation log at least includes operations of adding, modifying, deleting, and the like according to the file.
Moreover, the rule management module is used for configuring field-level quality inspection rules, which means that a user can set fields to be detected through the rule management module of the service platform without setting a bottom layer. The rule management module realizes the whole process management of basic information, one-to-many basis files and operation logs of the quality inspection rule, and the management contents include but are not limited to searching, adding, editing, deleting, checking and auditing the quality inspection rule, rule basis adding, ETL platform bottom layer rule binding, current list downloading, operation log recording, record number of time search options obtaining and the like. When the quality inspection rule is formulated, basic information fields of the quality inspection rule need to be perfected, wherein the basic information fields include, but are not limited to, rule names, rule numbers, basis types, rule categories, corresponding underlying rules and rule descriptions. The type-based representation includes different types of rules set for different types of files, and therefore the type-based representation includes rules corresponding to the text and rules corresponding to the business basis. The bottom layer rules comprise five types of null value verification, format verification, most value verification, value range verification and record number verification. The rule types refer to rule types corresponding to different quality inspection angles, and as shown in table 1 below, table 1 is a rule type table provided in the embodiment of the present application, and as shown in table 1, the data quality management system in the embodiment of the present application designs rules of different types from six angles of timeliness, integrity, consistency, accuracy, uniqueness, and rationality, thereby implementing 11 types of data quality rules and covering the standards of 206 files in the country.
Figure BDA0003020516600000061
Figure BDA0003020516600000071
TABLE 1
In the embodiment of the present application, the quality inspection rules cannot be repeated, and the user may design a certain rule number rule, for example, the rule number is determined by a rule of "GZA" +6 year and month +4 bit sequence number. The rule management module is also used for realizing one-to-many management of the quality control rules and the files, namely adding one or more files for binding in the process of newly adding and editing basic information of the quality control rules. It can be understood that the rule management module may record an operation log, where the content of the operation log at least includes operations such as adding, modifying, and deleting a quality inspection rule, and in addition, includes start-stop time of the quality inspection rule, an audit result, a log bound according to a file, and the like. Illustratively, as the fifth standard description of the identification number of the citizen according to national Standard of the people's republic of China-national identification number (GB 11643-1999): the citizen identification number is a characteristic combination code and consists of seventeen digital body codes and one check code, and the arrangement sequence of the citizen identification number sequentially comprises from left to right: a six-digit digital address code, an eight-digit digital birth date code, a three-digit sequence code, and a one-digit check code. Therefore, when a user uses the data quality management system of the embodiment of the application to perform quality inspection on the citizen identity number, different inspection rules can be used for different fields on the service platform, for example, length inspection is performed on eight-bit digital birth date codes, non-null inspection is performed on one-bit inspection codes, and the like, and the user can simply and quickly configure field-level quality inspection rules through the rule management module, so that quick quality inspection on data is realized.
In addition, the report management module is used for managing data quality inspection reports, and is used for managing the quality inspection reports, namely, after the quality inspection is finished according to the quality inspection rules of all quality inspection fields, the quality inspection reports containing the results of the qualified rate, the failure list and the like are finally output and managed. The quality inspection report at least comprises an overall quality inspection result, an overall quality inspection qualified rate, a qualified rate of each field and a quality inspection problem list. When the quality inspection report is generated, basic information fields of the quality inspection report need to be perfected, wherein the basic information fields comprise, but are not limited to, a quality inspection list number, a unit to which the data belongs, a data type, a quality inspection mechanism, a qualification rate, a detection unit, a contact telephone of the detection unit and a quality inspection list. The quality inspection list comprises information item names, quality inspection rules, quality inspection bases and quality inspection qualified rates. Exemplarily, when the report management module realizes a search function for a quality inspection report, the search is performed according to at least one field of a resource number, a resource name, a belonging department, a data type, a quality inspection state, and a warehousing time, taking the warehousing time as an example, the warehousing time may include all 5 options, such as a week, a month and a user-defined option, and then the service platform displays a quality inspection report record conforming to the warehousing time field.
Referring to fig. 3, the ETL platform includes a rule generating module 121, a variable managing module 122, a detecting module 123, and an instance managing module 124, where the rule defining module is used to predefine a quality inspection rule; the variable management module is used for managing time variables, and the time variables are used for determining quality inspection periods; the detection module is used for configuring a quality inspection rule, detecting a specific field of the data according to the quality inspection rule and generating an example; the instance management module is used for managing instances, and the instances at least comprise quality inspection logs and quality inspection reports.
The rule definition module is used for predefining the quality inspection rule, and means that the quality inspection rule required to be used is predefined for a common rule template or a national standard. After predefining is completed, the quality inspection rules can be directly quoted when being configured, one-step and one-step configuration is not needed, and the method can be used according to a data table in actual service when being used specifically.
Furthermore, the variable management module is used for managing a time variable, wherein the time variable is a value which changes according to time change, for example, the meaning of one time variable is set to be 3 days before the current date, the current quality inspection task requires data of three days before the current date for quality inspection, and if the current date is 1 month and 4 days, the date of the data which needs quality inspection is 1 month and 1 day; if the current date is changed, and the current date is 1 month and 5 days, the date of the data needing quality inspection is changed to 1 month and 2 days under the control of the current time variable. When a variable is defined, the variable can be used directly in the data quality definition window. The purpose of the variable application is to achieve the effects of quality inspection rule multiplexing, periodic calling and the like in a flexible and changeable service scene. For example, the data is updated on time, and in the scene of incremental quality inspection, the range of quality inspection can be dynamically controlled through variables.
In addition, the detection module is used for configuring quality inspection rules, detecting specific fields of the data according to the quality inspection rules and generating examples. What is different from the field-level quality inspection rule configured by the user in the service platform is that the ETL platform in the embodiment of the present application needs to configure a bottom-layer quality inspection rule for performing a series of inspection operations on a specific field of a specific data set of a table of a data source and configuring a scheduling or timing task of a quality inspection task. One quality inspection rule can be run for multiple times, and variable and timing scheduling are added to a quality inspection task, so that quality inspection modes with different combinations are realized.
In addition, the instance management module is used for managing instances, and the instances at least comprise quality inspection logs and quality inspection reports. The specific examples generated after the quality inspection task starts to run include at least a quality inspection log and a quality inspection report. Technical personnel can check the quality inspection logs through the examples, the technical personnel can check the specific execution process of the quality inspection conveniently, in addition, each quality inspection task has a quality inspection report, and the data quality management system provided by the embodiment of the application supports downloading of the Excel file containing the detailed quality inspection report.
Referring to fig. 3, in some embodiments, the data center includes a data warehouse tool 131, a file storage module 132 and a data synchronization tool 133, where the data warehouse tool 131 is used to extract, convert and load data, and in the embodiments of the present application, hive is used as the data warehouse tool, and hive is a set of data warehouse analysis systems constructed based on Hadoop, and provides a rich SQL query manner to analyze data stored in a Hadoop distributed file system: the structured data file can be mapped into a database table, and the complete SQL query function is provided. The File storage module 132 is used for storing data, in the embodiment of the present application, an HDFS (Hadoop Distributed File System) is used as the File storage module, and the HDFS can provide data access with high throughput, and is very suitable for application on a large-scale data set. The data synchronization tool 133 is used for synchronizing data, in the embodiment of the present application, sqoop is used as a data synchronization tool and is used as an open source tool, and sqoop is mainly used for data transmission between Hive and a conventional database (e.g., MySQL, Oracle, Postgres), and can lead data in a relational database to HDFS of Hadoop or lead data of HDFS to the relational database. As can be seen from the above, the data quality management system provided in the embodiment of the present application is designed based on a high-availability and distributed operation mechanism, and has the characteristics of high availability, high fault tolerance, and distributed.
In some embodiments, as shown in fig. 3, the data quality management system further includes an automatic quality inspection module, which is configured to perform automatic quality inspection on the directory attachment of the data, and it is understood that the automatic quality inspection may be periodic quality inspection or quality inspection initiated manually by a user, and the quality inspection range may be full-quality inspection, sampling quality inspection or timing increment quality inspection.
In some embodiments, as shown in fig. 3, the data quality management system further includes a third-party quality inspection module, and an Application Programming Interface (API) is provided at an upper layer of the system, so as to support a third-party Application to call and initiate a quality inspection request, so that the system of the embodiment of the present Application has good extensibility and is suitable for a wider service range.
To sum up, the embodiment of the present application provides a data quality management system, which includes a service platform, an ETL platform, a scheduling tool, and a data center, where a user can initiate a quality inspection service through the service platform, configure a quality inspection rule using the ETL platform, and perform quality inspection on data provided by the data center, and the system uses the scheduling tool to perform quality inspection task scheduling to complete quality inspection of the data. In addition, the embodiment of the application also provides a data quality management method, which is applied to the data quality management system. According to the embodiment of the application, the quality inspection report is generated through field-level quality inspection of the synchronous data, the data field with problems can be accurately positioned, the reason that the quality inspection cannot pass is conveniently analyzed according to the data field with problems, and the quality of the data quality inspection is effectively improved. The data quality management system provided by the embodiment of the application can cover most national standards, and supports the automatic configuration of quality inspection rules according to business needs, field-level accurate quality inspection of data, and a detailed quality inspection report.
Referring to fig. 4, fig. 4 is an apparatus provided in an embodiment of the present application, where the apparatus 400 includes at least one processor 410 and at least one memory 420 for storing at least one program; in fig. 4, a processor and a memory are taken as an example.
The processor and memory may be connected by a bus or other means, such as by a bus in FIG. 4.
The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Another embodiment of the present application also provides an apparatus that may be used to perform the control method as in any of the above embodiments, e.g., to perform the method steps of fig. 2 described above.
The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
The embodiment of the application also discloses a computer storage medium, wherein a program executable by a processor is stored, and the program executable by the processor is used for realizing the data quality management method provided by the application when being executed by the processor.
One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.
While the preferred embodiments of the present invention have been described, the present invention is not limited to the above embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention, and such equivalent modifications or substitutions are included in the scope of the present invention defined by the claims.

Claims (10)

1. A data quality management system is characterized by comprising a service platform, an ETL platform, a scheduling tool and a data center;
the business platform is used for providing quality inspection service;
the ETL platform is used for configuring and managing quality inspection rules;
the scheduling tool is used for scheduling quality inspection tasks;
the data center is used for managing data sources and executing quality inspection tasks.
2. The data quality management system of claim 1, wherein the ETL platform comprises a rule generation module, a variable management module, a detection module, and an instance management module;
the rule definition module is used for predefining a quality inspection rule;
the variable management module is used for managing time variables, and the time variables are used for determining quality inspection periods;
the detection module is used for configuring the quality inspection rule, detecting a specific field of data according to the quality inspection rule and generating an example;
the instance management module is used for managing the instances, and the instances at least comprise quality inspection logs and quality inspection reports.
3. The data quality management system of claim 1, wherein the service platform comprises a quality inspection module, a rule management module, a file management module, and a report management module;
the quality inspection module is used for initiating the quality inspection task;
the rule management module is used for configuring field-level quality inspection rules;
the file management module is used for managing a basis file;
the report management module is used for managing the quality inspection report.
4. The data quality management system of claim 1, wherein the data center comprises a data warehouse tool, a file storage module, and a data synchronization tool;
the data warehouse tool is used for extracting, converting and loading data;
the file storage module is used for storing data;
the data synchronization tool is used for synchronizing data.
5. The data quality management system of claim 1, wherein the system further comprises an automatic quality inspection module and a third party quality inspection module;
the automatic quality inspection module is used for automatically inspecting the catalog hanging of the data;
and the third-party quality inspection module is used for initiating quality inspection service by a third-party application program.
6. A data quality management method to which the data quality management system of any one of claims 1 to 5 is applied,
determining synchronous data according to a data source of the data center;
determining quality inspection service according to the business platform and the ETL platform;
according to the quality inspection service, performing quality inspection on the synchronous data and determining a quality inspection report;
the quality inspection service at least comprises a plurality of quality inspection fields and quality inspection rules corresponding to the quality inspection fields.
7. The data quality management method according to claim 6, characterized in that:
the quality inspection report at least comprises an overall quality inspection result, an overall quality inspection qualified rate, a qualified rate of each field and a quality inspection problem list.
8. The data quality management method according to claim 6, characterized in that:
the quality inspection rule at least comprises null value verification, format verification, most value verification, value range verification and record number verification.
9. An apparatus, comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement the data quality management method of any one of claims 6-8.
10. A computer storage medium in which a processor-executable program is stored, wherein the processor-executable program, when executed by the processor, is for implementing a data quality management method according to any one of claims 6 to 8.
CN202110401537.9A 2021-04-14 2021-04-14 Data quality management method, system, device and storage medium Pending CN113157676A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110401537.9A CN113157676A (en) 2021-04-14 2021-04-14 Data quality management method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110401537.9A CN113157676A (en) 2021-04-14 2021-04-14 Data quality management method, system, device and storage medium

Publications (1)

Publication Number Publication Date
CN113157676A true CN113157676A (en) 2021-07-23

Family

ID=76890455

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110401537.9A Pending CN113157676A (en) 2021-04-14 2021-04-14 Data quality management method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN113157676A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722352A (en) * 2021-08-31 2021-11-30 航天信息***工程(北京)有限公司 Intelligent data verification method, system and storage medium for reporting and reviewing scheme
CN115718745A (en) * 2023-01-09 2023-02-28 中科金瑞(北京)大数据科技有限公司 Data quality detection method and device based on DAG graph task scheduling

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107958049A (en) * 2017-11-28 2018-04-24 航天科工智慧产业发展有限公司 A kind of quality of data checking and administration system
CN109947746A (en) * 2017-10-26 2019-06-28 亿阳信通股份有限公司 A kind of quality of data management-control method and system based on ETL process
CN111159191A (en) * 2019-12-30 2020-05-15 深圳博沃智慧科技有限公司 Data processing method, device and interface

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109947746A (en) * 2017-10-26 2019-06-28 亿阳信通股份有限公司 A kind of quality of data management-control method and system based on ETL process
CN107958049A (en) * 2017-11-28 2018-04-24 航天科工智慧产业发展有限公司 A kind of quality of data checking and administration system
CN111159191A (en) * 2019-12-30 2020-05-15 深圳博沃智慧科技有限公司 Data processing method, device and interface

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722352A (en) * 2021-08-31 2021-11-30 航天信息***工程(北京)有限公司 Intelligent data verification method, system and storage medium for reporting and reviewing scheme
CN115718745A (en) * 2023-01-09 2023-02-28 中科金瑞(北京)大数据科技有限公司 Data quality detection method and device based on DAG graph task scheduling

Similar Documents

Publication Publication Date Title
US11301419B2 (en) Data retention handling for data object stores
EP2577507B1 (en) Data mart automation
CN113157676A (en) Data quality management method, system, device and storage medium
CN102999537A (en) System and method for data migration
EP2779044A1 (en) System and method to provide management of test data at various lifecycle stages
CN109241184B (en) Data synchronization method, device, computer equipment and storage medium
CN111400288A (en) Data quality inspection method and system
CN111460019A (en) Data conversion method and middleware of heterogeneous data source
CN104767795A (en) LTE MRO data statistical method and system based on HADOOP
CN113535856A (en) Data synchronization method and system
CN109977157A (en) A kind of method and electronic equipment importing data to target directory based on data platform
CN115794839B (en) Data collection method based on Php+Mysql system, computer equipment and storage medium
CN112148689A (en) Data sharing and exchanging system for city-level data middling station
CN112700083A (en) Method and device for constructing scene of resource comprehensive utilization and service index system
CN110895544A (en) Interface data processing method, device, system and storage medium
CN106777265B (en) Service data processing method and device
CN108984757A (en) A kind of data lead-in method and equipment
CN111723004B (en) Measurement method for agile software development, measurement data output method and device
CN109992573B (en) Method and system for realizing automatic monitoring of HDFS file occupancy rate
CN111078905A (en) Data processing method, device, medium and equipment
CN114398333A (en) Incremental data real-time synchronization method and device, electronic equipment and storage medium
US11663613B2 (en) Approaches for analyzing entity relationships
US10936571B1 (en) Undo based logical rewind in a multi-tenant system
CN114817391A (en) Wind control service data analysis method and device
CN116126797A (en) File cleaning method of big data cluster and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination