CN116028489A - Automatic data exploration method and application thereof - Google Patents

Automatic data exploration method and application thereof Download PDF

Info

Publication number
CN116028489A
CN116028489A CN202211631787.2A CN202211631787A CN116028489A CN 116028489 A CN116028489 A CN 116028489A CN 202211631787 A CN202211631787 A CN 202211631787A CN 116028489 A CN116028489 A CN 116028489A
Authority
CN
China
Prior art keywords
data
field
dimension
index
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211631787.2A
Other languages
Chinese (zh)
Inventor
李圣权
王增璞
曹鹏寅
叶俊宏
毛云青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CCI China Co Ltd
Original Assignee
CCI China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CCI China Co Ltd filed Critical CCI China Co Ltd
Priority to CN202211631787.2A priority Critical patent/CN116028489A/en
Publication of CN116028489A publication Critical patent/CN116028489A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides an automatic data exploration method and application thereof, comprising the steps of acquiring data exploration direction related data indexes of four dimensions of an account management index system, a database index system, a data table index system and a data field dimension index system through configuring MySQL sentences for inquiring a data parameter library, a permission library and a server performance library; meanwhile, according to specific service requirements, a field-level quality check rule is configured, corresponding parameters are configured for the rule, including information such as preset field length, field type and field name, the fields and the rule are matched through a classification algorithm and a multi-label algorithm, the efficiency and the applicability of the algorithm are optimized, the rule is operated, the data quality quantization is carried out on the result of the data exploration report in terms of the integrity, the effectiveness, the accuracy and the like of the data quality, and the script is packaged and deployed, so that the rapid scheduling and parameter adjusting use in the project are realized.

Description

Automatic data exploration method and application thereof
Technical Field
The application relates to the technical field of data management, in particular to an automatic data exploration method and application thereof.
Background
Data exploration is usually used as a first step after data acquisition in big data service and consultation service; the current situation of data quality can be quickly and intuitively known through data exploration, and guarantees are provided for later data mining, data development and data decision.
At present, a complete data exploration index system and a universal automation realization script are lacked in the field of data management, most data exploration stays in basic data quality information such as table quantity, null rate and the like, and the practical use in data management business is not great; meanwhile, the data exploration and the data quality module are not communicated, so that the meaning of the data exploration is reduced.
Therefore, an automated data exploration method and application thereof are needed to solve the problems of the prior art.
Disclosure of Invention
The embodiment of the application provides an automatic data exploration method and application thereof, aiming at the problems that the actual use is not great and the data exploration meaning is reduced in the prior art.
The core technology of the invention mainly comprises the steps of acquiring four-dimensional data exploration direction related data indexes of an account management index system, a database index system, a data table index system and a data field dimension index system through configuring MySQL sentences for inquiring a data parameter library, a permission library and a server performance library; meanwhile, according to specific service requirements, a field-level quality check rule is configured, corresponding parameters are configured for the rule, including information such as preset field length, field type and field name, the fields and the rule are matched through a classification algorithm and a multi-label algorithm, the efficiency and the applicability of the algorithm are optimized, the rule is operated, the data quality quantization is carried out on the result of the data exploration report in terms of the integrity, the effectiveness, the accuracy and the like of the data quality, and the script is packaged and deployed, so that the rapid scheduling and parameter adjusting use in the project are realized.
In a first aspect, the present application provides an automated data exploration method comprising the steps of:
s00, constructing a data exploration index system based on the data management service in a multi-dimension mode, and splitting the data exploration index system into data exploration indexes in a plurality of dimensions;
the dimensions of the data exploration index at least comprise a database dimension, a table dimension, a field dimension and an account management dimension, wherein the database dimension corresponds to a database dimension index, the table dimension corresponds to a table dimension index, the field dimension corresponds to a field dimension index, and the account management dimension corresponds to an account management dimension index;
s10, configuring query sentences with parameters for corresponding indexes by combining metadata information stored in each database, traversing all database accounts, database tables and database fields to finish statistics of data exploration indexes, and generating a plurality of computer language scripts capable of realizing automatic data exploration reports;
s20, completing ETL work of data management by using a data analysis base in the computer language;
s30, according to six evaluation dimensions of data quality, configuring corresponding data quality check rules under each dimension;
the six evaluation dimensions are integrity, uniqueness, consistency, accuracy, effectiveness and timeliness;
S40, performing multi-label classification on all data quality check rules according to index data in the field exploration report, setting the generation times of each classifier according to the number of data quality dimensions so as to optimize the algorithm operation efficiency, and simultaneously operating a quality rule script;
s50, the quality rule script is packaged and deployed on a server, and is configured according to service requirements to generate a required data exploration report.
Further, in the step S10, the method specifically includes the following steps:
acquiring database information to be probed, wherein the database information at least comprises a host, a port, a user name, a password and a database name;
the corresponding module is imported through a computer language to be connected with the database;
a function is defined to quickly configure the database to complete the SQL statement query and store the query results.
4. Further, in the step S10, the method further includes the following steps:
completing an automated python script of the account management dimension index, and storing a final account management index exploration result in a table by inquiring a user table in a MySQL database;
and (3) completing an automatic python script for the field dimension index, and generating a list to be probed and a MySQL script for inquiring the field information of the list by carrying out parameter configuration on the tables and columns in the MySQL database information_schema library.
Further, in the step S20, the method further includes the following steps:
generating a table file of a data exploration report, and traversing a straight MySQL script to acquire field basic information in the table file;
traversing and executing according to table names and field names in the table_col_name to make the table names and the field names serve as corresponding dynamic parameters so as to finish query MySQL script of a field dimension index probe report, and acquiring data indexes such as maximum value, minimum value, average value, enumeration value, null value rate, zero value rate, field length, maximum word frequency, word frequency of the same field in a library, associable library table and the like of the queried table fields;
and resetting the head of the table, merging the basic information of the table field and the dimension index information of the field, and writing the basic information and the dimension index information into the table file.
Further, in the step S30, the method specifically includes the following steps:
the method comprises the steps of through regular expressions and SQL query sentences with parameters, combining business requirements, customizing corresponding check scripts and rule names of rules, presetting, selecting consistency, accuracy, effectiveness and timeliness as rule check dimensions, creating a field-rule mapping table, and completing the quality check rule operation of configuration data indexes;
traversing one dimension of rule check dimensions of a field name in a field-rule mapping table to match a data index quality check rule;
If the matching is successful, running a corresponding rule script to carry out effective statistics; if the single dimension is matched with a plurality of data quality check rules, the field is stored in a data list to be trained.
Further, in the step S40, the method specifically includes the following steps:
discretizing continuous numerical value fields in the field exploration report to enable labels of the same discrete data index to be consistent with labels preset in a data quality check rule;
setting the generation times of each two classifiers according to the number of data quality dimensions, and ensuring that each field in a data list to be trained has one and only one verification rule in each quality verification dimension;
storing the field and the matched rule name into a field-rule mapping table;
traversing the matched quality check rule script of each field of the rule, and returning a character string which is not matched with the proper rule if the effective rate of the executed quality check rule script does not reach a threshold value;
and merging the final field consistency, accuracy, effectiveness and timely quality indexes with the integrity and uniqueness indexes in the form file to generate a final data quality report.
Further, in the step S40, the method further includes the following steps:
Completing an automatic python script for reporting the database indexes, and realizing a final database index exploration report;
and finishing the automatic python script for reporting the data table index, merging with the data of the previous step, and finishing and storing the data table index report.
In a second aspect, the present application provides an automated data exploration apparatus comprising:
the data exploration index system construction module is used for constructing a data exploration index system in a multi-dimensional mode based on the data management service and splitting the data exploration index system into data exploration indexes in a plurality of dimensions; the dimensions of the data exploration index at least comprise a database dimension, a table dimension, a field dimension and an account management dimension, wherein the database dimension corresponds to a database dimension index, the table dimension corresponds to a table dimension index, the field dimension corresponds to a field dimension index, and the account management dimension corresponds to an account management dimension index;
the configuration module is used for configuring query sentences with parameters for the corresponding indexes by combining metadata information stored in each database, traversing all database accounts, database tables and database fields to finish statistics of data exploration indexes, and generating a plurality of computer language scripts capable of realizing automatic data exploration reports;
The ETL module is used for completing ETL work of data management by utilizing a data analysis base in the computer language;
the classification optimization module is used for evaluating the dimensionality according to six data quality and configuring corresponding data quality check rules under each dimensionality; the six evaluation dimensions are integrity, uniqueness, consistency, accuracy, effectiveness and timeliness;
the communication deployment module is used for carrying out multi-label classification on all data quality check rules according to index data in the field exploration report, setting the generation times of each classifier according to the number of data quality dimensions so as to optimize the operation efficiency of the algorithm and simultaneously operating a quality rule script to realize the data quality report;
and the deployment output module is used for packaging and deploying the quality rule script on the server and configuring the quality rule script according to service requirements so as to generate a required data exploration report.
In a third aspect, the present application provides an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the automated data exploration method described above.
In a fourth aspect, the present application provides a readable storage medium having stored therein a computer program comprising program code for controlling a process to execute a process comprising an automated data exploration method according to the above.
The main contributions and innovation points of the invention are as follows: 1. compared with the prior art, the method and the device establish a data exploration index system, integrate the data exploration with a data quality check system, and fill data indexes in the data exploration system of each dimension according to service requirements in actual projects; ensuring that each index can be obtained through a short MySQL query statement with low time complexity; optimizing a multi-label algorithm, reducing the time complexity of the algorithm, completing automatic matching of data quality rule verification in a short time, and generating a multi-dimensional data quality verification report;
2. compared with the prior art, the python script for automatically generating the MySQL database data exploration report is formed by packaging the python script; aiming at database exploration with larger data volume, the script can also finish the generation of an exploration report in a short time, thereby greatly improving the working efficiency of data management in the initial stage in the project; meanwhile, the similarity between the configuration rule name and the field name is measured through the Levenstein distance, automatic quality check is completed, and a data quality report is generated.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is a flow chart of an automated data exploration method according to an embodiment of the present application;
FIG. 2 is a simplified flow chart of the data probe report of the present application;
FIG. 3 is a flow chart of the deployment script of the present application;
FIG. 4 is a simplified flow chart of the data quality check of the present application;
fig. 5 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.
It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.
At present, a complete data exploration index system and a universal automation realization script are lacked in the field of data management, most data exploration stays in basic data quality information such as table quantity, null rate and the like, and the practical use in data management business is not great; meanwhile, the data exploration and the data quality module are not communicated, so that the meaning of the data exploration is reduced.
Based on this, the present invention solves the problems of the prior art based on the python script.
Example 1
The method aims at providing a data exploration index system which is constructed through multiple dimensions on data management services, splitting the data exploration index system into nearly 40 data indexes of account authority dimension, database dimension, data table dimension and data field dimension, combining a MySQL database InnoDB engine to store metadata information in a MySQL library, an information_schema library and a performance_shcema library, configuring a query statement with a parameter for the corresponding index, traversing the data exploration index statistics of all database accounts, database tables and database table fields, and generating a python script capable of realizing an automatic data exploration report; ETL (abbreviation for English Extract-Transform-Load) for describing the process of extracting (extracting), converting (transforming), loading (Load) data from source to destination by using python language pandas library; meanwhile, six evaluation dimensions of data quality are divided into integrity, uniqueness, timeliness, effectiveness, accuracy and consistency indexes, corresponding data quality check rules are configured under each dimension, multi-label classification of the corresponding quality check rules is completed by adopting a binary relation method according to index data in a field exploration report, generation times of each two classifiers are set according to the number of the data quality dimensions, so that algorithm operation efficiency is optimized, a quality rule script is operated, and data quality report is realized; and finally, completing the encapsulation and deployment of the script on the server through a dock, performing periodic scheduling configuration according to service requirements, and running to generate a final week, month, quarter or year data exploration report.
Specifically, the automated data exploration method of the present application, referring to fig. 1 and 2 (abbreviated process), comprises the following steps:
s00, constructing a data exploration index system based on the data management service in a multi-dimension mode, and splitting the data exploration index system into data exploration indexes in a plurality of dimensions;
in this embodiment, according to the data exploration purpose and the specific service requirement, the data exploration index is divided into four dimensions, which are respectively: database dimensions, table dimensions, field dimensions, and account management dimensions.
The database dimension indexes comprise the number of tables, the number of views, the number of functions and the number of storage processes;
the table dimension index comprises a belonging library name, a table annotation, a table space size, a table number, a field length in a table, a table space fragment, a table character check code set and a latest update time;
the field dimension index comprises a field name, a field annotation, a data volume, a null rate, a zero value rate, a field type, a library table to which the field belongs, a maximum/small value, an average value, a unique value, a maximum field length, the number of the same field names in the library, the table to which the same name section belongs, a frequency maximum numerical value and a frequency maximum numerical value;
the account management dimension includes account authority information (such as inquiry, insertion, update, deletion, temporary table creation, table structure modification, etc.), inquiry times per hour, connection times per hour, maximum connection times, and other data indexes.
Preferably, the indexes in the data exploration index system are not limited to the basic data quality information, wherein several indexes with greater practicability and main purposes thereof are as follows:
1. database dimension index:
maximum number of connections: judging the maximum number of connections that the server can bear;
2. form dimension index:
the latest update time: judging whether the table is still updated continuously;
space debris: judging whether the data in the data table are deleted in a large amount in a short period;
3. field dimension index:
number of identical field names in library: which fields can be used as association conditions for realizing data development can be quickly found out;
table to which the same name field belongs: a table capable of carrying out association development can be quickly found;
4. account management dimension index:
account rights: the database 'adding, deleting and checking' permission of the current account can be quickly known;
number of queries per hour: whether the account number has inquiry times limitation;
number of connections per hour: whether the account has connection times limitation.
S10, configuring query sentences with parameters for corresponding indexes by combining metadata information stored in each database, traversing all database accounts, database tables and database fields to finish statistics of data exploration indexes, and generating a plurality of computer language scripts capable of realizing automatic data exploration reports;
In this embodiment, specifically, the method includes:
s11, acquiring MySQL database information to be probed, wherein the MySQL database information comprises a host, a port, a user name, a password and a database name; the MySQL database is connected by importing the pymysql module using the python language. Defining an sql_query function so as to quickly configure a database to complete SQL statement query, and storing a query result;
s12, completing an automatic python script for account management dimension indexes, creating temporary table authorities and hourly query times (0 represents unlimited), hourly connection times (0 represents unlimited), maximum connection number and other data indexes (0 represents unlimited) by executing and querying account names, SELECT command authorities, INSERT command authorities, UPDATE command authorities, DELETE command authorities, create new databases and table authorities, DELETE existing databases and table authorities, reload internal cache authorities, close MySQL server authorities, grant other account authorities, create and DELETE table index authorities, rename and modify table structure authorities, create temporary table authorities, and store final account management exploration results in EXCEL;
s13, completing an automatic python script for field dimension indexes, and generating a list to be probed and a MySQL script for inquiring field information by carrying out parameter configuration on a tables list and a columns list in a MySQL database information_schema library;
S20, using a data analysis library in the computer language to complete ETL (abbreviation of English Extract-Transform-Load) of data management, which is used for describing the process of extracting (extracting), converting (transforming) and loading (Load) data from a source end to a destination end;
in this embodiment, the ETL process is specifically:
s21, generating an excel file of a data exploration report, traversing a MySQL script stored in an executive table_sql, and acquiring basic information of fields in a table, such as a field Chinese name, a field English name, a table name to which the field belongs, a field type and the like;
s22, traversing and executing according to table names and field names in the table_col_name to enable the table names and the field names to serve as corresponding dynamic parameters, finishing query MySQL script of a field dimension index probe report, and acquiring data indexes such as maximum value, minimum value, average value, enumeration value, null value rate, zero value rate, field length, maximum word frequency, word frequency of the same field in a library, associable library table and the like of the fields of the query table;
s23, resetting the head-up of the table, merging the basic information of the field segment of the table and the dimension index information of the field, and writing the combined information into a prepared excel probe report file (serving as a final data probe report);
Wherein the null rate reflects the integrity dimension of the data quality; the number of the same field names in the database reflects the uniqueness dimension of the data quality; and the consistency, accuracy, effectiveness and timeliness of the data quality are checked through the data quality rule.
S30, according to six evaluation dimensions of data quality, configuring corresponding data quality check rules under each dimension;
the six evaluation dimensions are integrity, uniqueness, consistency, accuracy, effectiveness and timeliness;
in this embodiment, specifically, the method includes:
s31, configuring a data index quality check rule by combining a regular expression and a query statement with a parameter SQL and a service requirement; the method comprises the steps that a rule corresponding to a check script and a rule name are required to be customized, information such as a rule applicable field type, a rule data volume, a rule null rate, a rule zero value rate, a rule maximum field length, a frequency maximum numerical value and the like is preset (wherein continuous numerical values are adjusted to discrete numerical values), and consistency, accuracy, effectiveness or timeliness are selected as rule check dimensions; a field-rule mapping table is created, so that history matching results can be conveniently and rapidly inquired, and the operation efficiency of the device is improved;
S32, quickly traversing one dimension of consistency, accuracy, effectiveness or timeliness of field names in a field-rule mapping table, and matching a data quality check rule;
if the matching is successful, the corresponding rule script is directly operated to count the effective rate; if a plurality of corresponding rules exist in a single dimension (matching fails), storing the field into a data list to be trained;
wherein, the existence of multiple corresponding rules in a single dimension refers to that in a historical matching result, a current field may be matched to multiple rules in a single dimension in integrity, uniqueness, consistency, accuracy, validity and timeliness, and since the service requires that only one rule in a single quality dimension of each field is checked, when the field is associated with multiple rules in the same dimension, the data list to be trained (i.e. the field list to be matched in fig. 4) is stored, and the classification algorithm in fig. 4 is operated.
S40, performing multi-label classification on all data quality check rules according to index data in the field exploration report, setting the generation times of each classifier according to the number of data quality dimensions so as to optimize the algorithm operation efficiency, and simultaneously operating a quality rule script;
In this embodiment, specifically, the method includes:
s41, discretizing continuous numerical value fields in a data exploration report, wherein labels of the same discrete data index are consistent with labels preset by rule verification;
s42, because a single field needs to be quantized into four mass dimensions of consistency, accuracy, effectiveness and timeliness through a data quality check script, the number of data quality check rules which can be matched by the field is 4, and each dimension is matched with one check rule at most; the traditional binary classification method cannot limit the number of categories obtained by classification through the classifier, and the traditional method divides all n data quality rules into n data sets, wherein each data set comprises fields in all data lists to be trained, so that a classifier is trained for each field and each check rule, and the problems of excessively low efficiency and the like can be generated when the data quantity is excessively large; according to specific data quality check service requirements, the embodiment dynamically updates the data list with training, and the specific scheme is as follows:
ls_index (to-be-trained field list) = [ I1, I2, I3, …, in ]
ls_rules=[[R11,R12,R13,…,R1m],
[R21,R22,R23,…,R2m],
[R31,R32,R33,…,R3m],
[R41,R42,R43,…,R4m]]
Where ls_rule is a list of rules, n in Rnm represents the dimension to which the rule belongs, and m represents the ordering in the rule of the dimension.
The ls_rule is decomposed into m sub-data sets, and contains each field in the ls_index, the sub-data sets are as follows:
ls_comb_n_m (sub-data set nm, the remaining sub-data set structure is the same as this)
=[[I1,(-)Enm],[I1,(-)Enm],[I2,(-)Enm]…,[In,(-)Enm]](n in[1,2,3,4],m in[1,2,3,…,len(ls_rules[n])])
The existence of (-) indicates that the field is not matched with the Rnm rule in the consistency rule list through the result display of the classification algorithm, and if (-) does not exist, the field is matched with the rule Rnm; the optimization point is that after the field In is successfully matched with any Rnm, all Rn rule two classifiers En combined with In are temporarily deleted, namely, two classifier training is carried out on quality check dimension rules which are not corresponding to Rn any more; the input term of the classification algorithm is described herein as an index term in the field-level data probe report, and the output result is a matching result with the rule (binary result).
Wherein Enm is the result label of the classification algorithm, and the "-" indicates that the field In is not matched with the rule Enm, the result label is stored In a list, and when the field is matched with the first Enm, the classification algorithm is terminated to classify In, and the algorithm calculation of the in+1 field is started.
Where len (ls_rule [ n ]) represents the length of the list of m ranging from 1 to ls_rule [ n ].
The optimization process ensures that each field has one check rule in each quality check dimension, the calculation efficiency is greatly improved, and the final optimized binary classification method result is as follows:
ls_output=[[I1,(R14,R25,R35,R44)],
[I2,(R13,R24,R35,R45)],
[I3,(R14,R27,R39,R49)],
[I4,(R11,R23,R35,R47)],]
……
[In,(R12,R24,R36,R48)]]
As can be seen, the conventional binary relation method is to generate a result label from each field and each rule, so that the calculation time complexity is very high, and the number of times of traversing the calculation classification algorithm=the number of fields=the number of rules; after the optimization, the classification algorithm of the next field is immediately carried out after the field is matched with one of the dimensions in a positive correlation manner.
S43, traversing four matched quality check rule scripts of each field of the rule, and returning a character string which is not matched with the proper rule if the effective rate after execution does not reach a threshold value; combining the final field consistency, accuracy, validity and timeliness quality metrics with the integrity and uniqueness metrics in the data probe table to generate a final data quality report (quality inspection report in fig. 4);
s44, storing the fields and the matched rule names into a field-rule mapping table, so that the quick scheduling script can be conveniently and later performed;
s45, completing an automatic python script for reporting database indexes, counting the maximum association number indexes of a database by executing a MySQL statement for inquiring max_connections, completing data cleaning, and adding a schema_info field as an association condition for subsequent data merging;
S46, performing MySQL query statement on tables and views in the information_schema library and procs_priv in the MySQL library to finish statistics on indexes of the number of the database tables, the number of views, the number of functions and the number of storage processes, and merging with the maximum association number result of the database by adding the schema_name field to realize a final database index exploration report (namely, outputting in the form of an independent excel table, wherein excel comprises a plurality of sheet pages, and each page is a data index exploration report of independent dimension);
s47, completing an automatic python script for reporting the index of the data table, counting the number of table records, the length of the table field, the fragments of the table space and the check code set of the table character by executing MySQL statement of the tables in the query information_schema library, and generating an associated field by splicing the database name and the data table name;
s48, counting the file size and the actual file size of a MySQL statement of an innodb table in the information_schema library by executing query, generating an associated field by splicing the database name and the data table name, carrying out association combination with the data index table in the step S45, and finishing and storing a data table index exploration report;
The two python files are used for complementing the index (other two python scripts or called files are also about contents, so that four index tables are generated respectively), and the index dimensions of the two data tables, namely the table file size and the actual file size, are obtained and are combined and stored in the data table index exploration report because the calculation sources of the different indexes are different and the association is needed through the unique (database name and data table name).
Preferably, as shown in fig. 4, the data quality check procedure is shown, and the step S40 can be simply and clearly shown. The data exploration report is an excel exploration report file after ETL in the step S20; the rule base is a base formed by data index quality check rules;
s50, the quality rule script is packaged and deployed on a server, and is configured according to service requirements to generate a required data exploration report.
In this embodiment, as shown in fig. 3, specifically:
s51, generating a request. Txt file, integrating a final script (a required python library, such as a database connection script, a data index calculation script, a data quality check script, a rule-field classification script and a data exploration report merging script) into the file, and finishing packaging;
S52, creating a file of a dock, defining a mirror image source, configuring a main file name of a run command, and writing a python package to be imported into a request.
S53, uploading the packaged folders to a server through a dock, and creating a mirror image; i.e. entering a docker connection server, uploading a folder containing py, request. Txt and Dockerfile, and creating a mirror image through suodo docker build-t (mirror name);
s54, running a py file generation container in a server of the client, so as to complete application deployment of the automatic script. That is, the mapping is completed by executing the image generation container by the run command and configuring parameters in config ().
Example two
Based on the same conception, the application also provides an automatic data exploration device, which comprises:
the data exploration index system construction module is used for constructing a data exploration index system in a multi-dimensional mode based on the data management service and splitting the data exploration index system into data exploration indexes in a plurality of dimensions; the dimensions of the data exploration index at least comprise a database dimension, a table dimension, a field dimension and an account management dimension, wherein the database dimension corresponds to a database dimension index, the table dimension corresponds to a table dimension index, the field dimension corresponds to a field dimension index, and the account management dimension corresponds to an account management dimension index;
The configuration module is used for configuring query sentences with parameters for the corresponding indexes by combining metadata information stored in each database, traversing all database accounts, database tables and database fields to finish statistics of data exploration indexes, and generating a plurality of computer language scripts capable of realizing automatic data exploration reports;
the ETL module is used for completing ETL work of data management by utilizing a data analysis base in the computer language;
the classification optimization module is used for evaluating the dimensionality according to six data quality and configuring corresponding data quality check rules under each dimensionality; the six evaluation dimensions are integrity, uniqueness, consistency, accuracy, effectiveness and timeliness;
the communication deployment module is used for carrying out multi-label classification on all data quality check rules according to index data in the field exploration report, setting the generation times of each classifier according to the number of data quality dimensions so as to optimize the operation efficiency of the algorithm and simultaneously operating a quality rule script to realize the data quality report;
and the deployment output module is used for packaging and deploying the quality rule script on the server and configuring the quality rule script according to service requirements so as to generate a required data exploration report.
Example III
This embodiment also provides an electronic device, referring to fig. 5, comprising a memory 404 and a processor 402, the memory 404 having stored therein a computer program, the processor 402 being arranged to run the computer program to perform the steps of any of the method embodiments described above.
In particular, the processor 402 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.
The memory 404 may include, among other things, mass storage 404 for data or instructions. By way of example, and not limitation, memory 404 may comprise a Hard Disk Drive (HDD), floppy disk drive, solid State Drive (SSD), flash memory, optical disk, magneto-optical disk, tape, or Universal Serial Bus (USB) drive, or a combination of two or more of these. Memory 404 may include removable or non-removable (or fixed) media, where appropriate. Memory 404 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 404 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, memory 404 includes Read-only memory (ROM) and Random Access Memory (RAM). Where appropriate, the ROM may be a mask-programmed ROM, a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), an electrically rewritable ROM (EAROM) or FLASH Memory (FLASH) or a combination of two or more of these. The RAM may be static random-access memory (SRAM) or dynamic random-access memory (DynamicRandomAccessM emory DRAM), where the DRAM may be fast page mode dynamic random-access memory 404 (FPMDRAM), extended data output dynamic random-access memory ExtendedDateOutDynamicRandomAccessMe mory (EDODRAM), synchronous dynamic random-access memory SynchronousDynamicRa ndom-access memory SDRAM, and the like, as appropriate.
Memory 404 may be used to store or cache various data files that need to be processed and/or used for communication, as well as possible computer program instructions for execution by processor 402.
The processor 402 implements any of the automated data probing methods of the above embodiments by reading and executing computer program instructions stored in the memory 404.
Optionally, the electronic apparatus may further include a transmission device 406 and an input/output device 408, where the transmission device 406 is connected to the processor 402 and the input/output device 408 is connected to the processor 402.
The transmission device 406 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wired or wireless network provided by a communication provider of the electronic device. In one example, the transmission device includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through the base station to communicate with the internet. In one example, the transmission device 406 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
The input-output device 408 is used to input or output information.
Example IV
The present embodiment also provides a readable storage medium having stored therein a computer program comprising program code for controlling a process to execute the process, the process comprising the automated data exploration method according to the first embodiment.
It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and this embodiment is not repeated herein.
In general, the various embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
Embodiments of the invention may be implemented by computer software executable by a data processor of a mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also referred to as program products) including software routines, applets, and/or macros can be stored in any apparatus-readable data storage medium and they include program instructions for performing particular tasks. The computer program product may include one or more computer-executable components configured to perform embodiments when the program is run. The one or more computer-executable components may be at least one software code or a portion thereof. In addition, in this regard, it should be noted that any blocks of the logic flows as illustrated may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on a physical medium such as a memory chip or memory block implemented within a processor, a magnetic medium such as a hard disk or floppy disk, and an optical medium such as, for example, a DVD and its data variants, a CD, etc. The physical medium is a non-transitory medium.
It should be understood by those skilled in the art that the technical features of the above embodiments may be combined in any manner, and for brevity, all of the possible combinations of the technical features of the above embodiments are not described, however, they should be considered as being within the scope of the description provided herein, as long as there is no contradiction between the combinations of the technical features.
The foregoing examples merely represent several embodiments of the present application, the description of which is more specific and detailed and which should not be construed as limiting the scope of the present application in any way. It should be noted that variations and modifications can be made by those skilled in the art without departing from the spirit of the present application, which falls within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims (10)

1. An automated data exploration method, comprising the steps of:
s00, constructing a data exploration index system based on the data management service in a multi-dimension mode, and splitting the data exploration index system into data exploration indexes in a plurality of dimensions;
the dimensions of the data exploration index at least comprise a database dimension, a table dimension, a field dimension and an account management dimension, wherein the database dimension corresponds to a library dimension index, the table dimension corresponds to a table dimension index, the field dimension corresponds to a field dimension index, and the account management dimension corresponds to an account management dimension index;
S10, configuring query sentences with parameters for corresponding indexes by combining metadata information stored in each database, traversing all database accounts, database tables and database fields to finish statistics of data exploration indexes, and generating a plurality of computer language scripts capable of realizing automatic data exploration reports;
s20, completing ETL work of data management by using a data analysis base in the computer language;
s30, according to six evaluation dimensions of data quality, configuring corresponding data quality check rules under each dimension;
the six evaluation dimensions are integrity, uniqueness, consistency, accuracy, effectiveness and timeliness;
s40, performing multi-label classification on all data quality check rules according to index data in the field exploration report, setting the generation times of each classifier according to the number of data quality dimensions so as to optimize the algorithm operation efficiency, and simultaneously operating a quality rule script;
s50, the quality rule script is packaged and deployed on a server, and is configured according to service requirements to generate a required data exploration report.
2. An automated data exploration method according to claim 1, wherein in step S10, the method comprises the steps of:
Acquiring database information to be probed, wherein the database information at least comprises a host, a port, a user name, a password and a database name;
the corresponding module is imported through a computer language to be connected with the database;
a function is defined to quickly configure the database to complete the SQL statement query and store the query results.
3. An automated data exploration method according to claim 1, wherein in step S10, further comprising the steps of:
completing an automated python script of the account management dimension index, and storing a final account management index exploration result in a table by inquiring a user table in a MySQL database;
and (3) completing an automatic python script for the field dimension index, and generating a list to be probed and a MySQL script for inquiring the field information of the list by carrying out parameter configuration on the tables and columns in the MySQL database information_schema library.
4. An automated data exploration method according to claim 3, wherein in step S20, further comprising the steps of:
generating a table file of a data exploration report, and traversing the MySQL script in a straight line to acquire field basic information in the table file;
traversing and executing according to table names and field names in the table_col_name to make the table names and the field names serve as corresponding dynamic parameters so as to finish query MySQL script of a field dimension index probe report, and acquiring data indexes such as maximum value, minimum value, average value, enumeration value, null value rate, zero value rate, field length, maximum word frequency, word frequency of the same field in a library, associable library table and the like of the queried table fields;
And resetting the head of the table, merging the basic information of the table field and the dimension index information of the field, and writing the basic information and the dimension index information into the table file.
5. The automated data probing method of claim 4, wherein in step S30, the method specifically comprises the steps of:
the method comprises the steps of through regular expressions and SQL query sentences with parameters, combining business requirements, customizing corresponding check scripts and rule names of rules, presetting, selecting consistency, accuracy, effectiveness and timeliness as rule check dimensions, creating a field-rule mapping table, and completing the quality check rule operation of configuration data indexes;
traversing one dimension of rule check dimensions of field names in the field-rule mapping table to match a data index quality check rule;
if the matching is successful, running a corresponding rule script to carry out effective statistics; if the single dimension is matched with a plurality of data quality check rules, the field is stored in a data list to be trained.
6. An automated data probing method according to claim 5, wherein in step S40, the method specifically comprises the steps of:
discretizing continuous numerical value fields in the field exploration report to enable labels of the same discrete data index to be consistent with labels preset in a data quality check rule;
Setting the generation times of each two classifiers according to the number of data quality dimensions, and ensuring that each field in a data list to be trained has one and only one verification rule in each quality verification dimension;
storing the field and the matched rule name into the field-rule mapping table;
traversing the matched quality check rule script of each field of the rule, and returning a character string which is not matched with the proper rule if the effective rate of the executed quality check rule script does not reach a threshold value;
and merging the final field consistency, accuracy, effectiveness and timeliness quality indexes with the integrity and uniqueness indexes in the table file to generate a final data quality report.
7. An automated data exploration method according to claim 6, wherein in step S40, further comprising the steps of:
completing an automatic python script for reporting the database indexes, and realizing a final database index exploration report;
and finishing the automatic python script for reporting the data table index, merging with the data of the previous step, and finishing and storing the data table index report.
8. An automated data inspection apparatus, comprising:
the data exploration index system construction module is used for constructing a data exploration index system in a multi-dimensional mode based on the data management service and splitting the data exploration index system into data exploration indexes in a plurality of dimensions; the dimensions of the data exploration index at least comprise a database dimension, a table dimension, a field dimension and an account management dimension, wherein the database dimension corresponds to a database dimension index, the table dimension corresponds to a table dimension index, the field dimension corresponds to a field dimension index, and the account management dimension corresponds to an account management dimension index;
The configuration module is used for configuring query sentences with parameters for the corresponding indexes by combining metadata information stored in each database, traversing all database accounts, database tables and database fields to finish statistics of data exploration indexes, and generating a plurality of computer language scripts capable of realizing automatic data exploration reports;
the ETL module is used for completing ETL work of data management by utilizing a data analysis base in the computer language;
the classification optimization module is used for evaluating the dimensionality according to six data quality and configuring corresponding data quality check rules under each dimensionality; the six evaluation dimensions are integrity, uniqueness, consistency, accuracy, effectiveness and timeliness;
the communication deployment module is used for carrying out multi-label classification on all data quality check rules according to index data in the field exploration report, setting the generation times of each classifier according to the number of data quality dimensions so as to optimize the operation efficiency of the algorithm and simultaneously operating a quality rule script to realize the data quality report;
and the deployment output module is used for packaging and deploying the quality rule script on the server and configuring the quality rule script according to service requirements so as to generate a required data exploration report.
9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the automated data exploration method of any of claims 1 to 8.
10. A readable storage medium, characterized in that the readable storage medium has stored therein a computer program comprising program code for controlling a process to execute a process comprising an automated data exploration method according to any of claims 1 to 8.
CN202211631787.2A 2022-12-19 2022-12-19 Automatic data exploration method and application thereof Pending CN116028489A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211631787.2A CN116028489A (en) 2022-12-19 2022-12-19 Automatic data exploration method and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211631787.2A CN116028489A (en) 2022-12-19 2022-12-19 Automatic data exploration method and application thereof

Publications (1)

Publication Number Publication Date
CN116028489A true CN116028489A (en) 2023-04-28

Family

ID=86080491

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211631787.2A Pending CN116028489A (en) 2022-12-19 2022-12-19 Automatic data exploration method and application thereof

Country Status (1)

Country Link
CN (1) CN116028489A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117151053A (en) * 2023-11-01 2023-12-01 东莞信宝电子产品检测有限公司 Report automation realization method, system and medium
CN117273552A (en) * 2023-11-22 2023-12-22 山东顺国电子科技有限公司 Big data intelligent treatment decision-making method and system based on machine learning

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117151053A (en) * 2023-11-01 2023-12-01 东莞信宝电子产品检测有限公司 Report automation realization method, system and medium
CN117151053B (en) * 2023-11-01 2024-02-13 东莞信宝电子产品检测有限公司 Report automation realization method, system and medium
CN117273552A (en) * 2023-11-22 2023-12-22 山东顺国电子科技有限公司 Big data intelligent treatment decision-making method and system based on machine learning
CN117273552B (en) * 2023-11-22 2024-02-13 山东顺国电子科技有限公司 Big data intelligent treatment decision-making method and system based on machine learning

Similar Documents

Publication Publication Date Title
CN116028489A (en) Automatic data exploration method and application thereof
CN110069449B (en) File processing method, device, computer equipment and storage medium
US11526799B2 (en) Identification and application of hyperparameters for machine learning
CN109558440A (en) Batch data processing method, device, computer equipment and storage medium
US11238040B2 (en) Grouping in analytical databases
US9141665B1 (en) Optimizing search system resource usage and performance using multiple query processing systems
US10915533B2 (en) Extreme value computation
CN113312336A (en) Data processing method, data processing apparatus, electronic device, storage medium, and program product
US20210019804A1 (en) Systems and methods for generating synthetic data
CN113282630A (en) Data query method and device based on interface switching
CN114218263B (en) Materialized view automatic creation method and materialized view based quick query method
CN113535766B (en) Workflow configuration method, device, electronic device and storage medium
CN108038253B (en) Log query processing method and device
US20180232145A1 (en) Method of storing data, information processing apparatus and non-transitory computer-readable storage medium
US20230153286A1 (en) Method and system for hybrid query based on cloud analysis scene, and storage medium
US8005844B2 (en) On-line organization of data sets
US20200117638A1 (en) Method, device and computer program product for searching a file
CN116049509A (en) Data query method, device, equipment and medium based on regular matching
CN115658680A (en) Data storage method, data query method and related device
CN115794861A (en) Offline data query multiplexing method based on feature abstract and application thereof
CN114817218A (en) Data verification method and system, electronic device and readable storage medium
US20220245097A1 (en) Hashing with differing hash size and compression size
CN113254349A (en) AB test processing method, device, equipment and storage medium based on cloud function
CN111399759A (en) Method for reading data and writing data and object file system
CN117931248B (en) Method and device for improving deployment efficiency of boulder application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination