CN107798007B

CN107798007B - Distributed database data verification method, device and related device

Info

Publication number: CN107798007B
Application number: CN201610794307.2A
Authority: CN
Inventors: 郭龙波; 丁岩; 徐宜良; 张宗禹; 林周凯
Original assignee: Jinzhuan Xinke Co Ltd
Current assignee: Jinzhuan Xinke Co Ltd
Priority date: 2016-08-31
Filing date: 2016-08-31
Publication date: 2024-03-19
Anticipated expiration: 2036-08-31
Also published as: CN107798007A

Abstract

The invention discloses a method, a device and a related device for checking online data of a distributed database, which are used for determining the consistency of the data to be changed before and after the change by comparing whether check values of specified line data in the data to be changed before and after the introduction are consistent, thereby effectively solving the problem that the distributed database in the prior art cannot determine the consistency of the data before and after the change.

Description

Distributed database data verification method, device and related device

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a method and an apparatus for checking distributed database data, and a related apparatus.

Background

With the wide application of database technology and the continuous accumulation of online service data, especially the rapid development of internet service, the data volume is increasing, and the performance of a single database becomes the bottleneck of online service, while the distributed database can provide high-performance, large-storage and high-concurrency database service, so that the distributed database is rapidly applied to various online service scenes.

However, when the existing distributed database is used for data migration and data initialization, the consistency of the data before and after the data change cannot be determined, so that the application range of the distributed database is limited.

Disclosure of Invention

The invention provides a method, a device and a related device for checking data of a distributed database, which are used for solving the problem that the distributed database in the prior art cannot determine the consistency of the data before and after the data is changed.

In one aspect, the invention provides a method for checking data of a distributed database, which comprises the following steps:

the method comprises the steps of leading out data to be changed into a data description text, and calculating a check value of specified row data in the data to be changed according to the led out data description text;

splitting the data to be changed according to rows, and importing the split data to be changed into corresponding database nodes;

after the data is imported, calculating the check value of the specified line data in the data to be changed after the data is imported, comparing whether the check values of the specified line data in the data to be changed are consistent before and after the data is imported, and if so, determining that the data to be changed are consistent before and after the data is changed.

Further, the calculating the check value of the specified row data in the data to be changed specifically includes:

and calculating the check value of a certain row of data appointed in the data to be changed, or calculating the sum of the check values of one or more continuous N rows of data appointed in the data to be changed.

Further, when the specified row data is a certain row, the calculating the verification value of the specified row data in the data to be changed after the data to be changed is imported specifically includes: calculating a check value of a certain row of data appointed in the data to be changed after being imported; the comparing whether the check values of the specified row data in the data to be changed are consistent or not specifically comprises: comparing the check value of a certain row of data appointed in the data to be changed before and after the importing;

when the specified line is one or more continuous N lines of data, the calculating the check value of the specified line data in the data to be changed after the data to be changed is imported specifically includes: calculating the sum of check values of one or more continuous N rows of data appointed in the data to be changed after being imported; the comparing whether the check values of the specified row data in the data to be changed are consistent or not specifically comprises: and comparing the sum of check values of one or more continuous N rows of data appointed in the data to be changed before and after the importing.

Further, after splitting the data to be changed according to the rows, and before importing the split data to be changed to a corresponding database node, the method further includes:

and acquiring database nodes in which the split data to be changed are respectively stored according to a distributed distribution rule.

Further, importing the split data to be changed to a corresponding database node, which specifically includes:

writing the split data to be changed into a file cache of a corresponding database node, notifying a database cluster to manage the completed file number and file name list, and triggering a database agent to download the data to be changed stored in the file cache to the database node through the database cluster management;

wherein the database agents are respectively in one-to-one correspondence with the database nodes.

Further, the data to be changed comprises data to be initialized, data to be migrated and data to be re-distributed.

In another aspect, the present invention provides an apparatus for checking data in a distributed database, including:

the first calculation unit is used for exporting data to be changed into a data description text, and calculating a check value of specified row data in the data to be changed according to the exported data description text;

an importing unit, configured to split the data to be changed according to a row, and import the split data to be changed to a corresponding database node;

the second calculation unit is used for calculating the check value of the specified row data in the data to be changed after the data is imported;

and the comparison unit is used for comparing whether the check values of the specified row data in the data to be changed are consistent before and after the leading-in, and if so, determining that the data to be changed are consistent before and after the changing.

Further, the first calculating unit is further configured to calculate a check value of a certain row of data specified in the data to be changed, or calculate a sum of check values of one or more continuous N rows of data specified in the data to be changed.

Further, the second calculating unit is further configured to calculate, when the specified row of data is a row of data, a check value of the specified row of data in the data to be changed after the data to be changed is imported; when the specified behavior is one or more continuous N lines of data, calculating the sum of check values of the one or more continuous N lines of data specified in the data to be changed after being imported;

the comparison unit is further used for comparing the check value of the specified line data in the data to be changed before and after the data to be changed is imported when the specified line data are the same; and comparing the sum of check values of one or more continuous N lines of data appointed in the data to be changed before and after the introduction when the appointed line is one or more continuous N lines of data.

Further, the importing unit further includes:

the splitting module is used for splitting the data to be changed according to the rows;

the acquisition module is used for acquiring database nodes in which the split data to be changed are respectively stored according to a distributed distribution rule;

and the importing module is used for importing the split data to be changed to a corresponding database node.

Further, the importing unit further includes:

and the importing module is used for writing the split data to be changed into the file cache of the corresponding database node, notifying the database cluster to manage the completed file number and file name list, triggering the database agent to download the data to be changed stored in the file cache to the database node through the database cluster management, wherein the database agent corresponds to the database node one by one.

In a further aspect, the invention provides a database cluster server provided with the device for checking any distributed database data.

The invention has the following beneficial effects:

the invention determines the consistency of the data to be changed before and after the change by comparing whether the check values of the specified line data in the data to be changed before and after the introduction are consistent, and effectively solves the problem that the distributed database in the prior art cannot determine the consistency of the data before and after the change.

Drawings

FIG. 1 is a flow chart of a method for distributed database data verification according to an embodiment of the present invention;

FIG. 2 is a flow chart of another method for distributed database data verification according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an apparatus for checking data of a distributed database according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the architecture of a system for online data migration according to an embodiment of the present invention.

Detailed Description

In order to solve the problem that the data consistency before and after the data change cannot be determined in the distributed database in the prior art, the invention provides a method, a device and a related device for verifying the data of the distributed database. The present invention will be described in further detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Method embodiment

The embodiment of the invention provides a method for checking distributed database data, an execution subject of the invention is a database cluster server, and referring to fig. 1, the method comprises the following steps:

s101, data to be changed is exported to form a data description text, and a check value of specified row data in the data to be changed is calculated according to the exported data description text;

s102, splitting the data to be changed according to the row, and importing the split data to be changed to a corresponding database node;

s103, after data import is completed, calculating a check value of specified row data in the data to be changed after the data import is completed;

s104, comparing whether the check values of the specified row data in the data to be changed are consistent before and after the leading-in, and if so, determining that the data to be changed are consistent before and after the changing.

That is, the invention determines the consistency of the data to be changed before and after the change by comparing whether the check values of the specified line data in the data to be changed before and after the introduction are consistent, and effectively solves the problem that the distributed database in the prior art cannot determine the consistency of the data before and after the change.

In specific implementation, the embodiment of the present invention specifically includes the following step S101: and calculating the check value of a certain row of data appointed in the data to be changed, or calculating the sum of the check values of one or more continuous N rows of data appointed in the data to be changed.

That is, the present invention can compare whether the data to be changed is identical before and after the import by calculating the check value of a certain line of data specified in the data to be changed by simple sampling, or by calculating the sum of the check values of one or more continuous N lines of data specified in the data to be changed by a larger range of sampling.

It should be noted that, in the scheme of calculating the sum of the check values of the one or more continuous N rows of data specified in the data to be changed according to the embodiment of the present invention, checking is performed on all rows of the data to be changed.

Specifically, when the specified row data is a certain row data, the calculating the verification value of the specified row data in the data to be changed after the data to be changed is imported specifically includes: calculating a check value of a certain row of data appointed in the data to be changed after being imported; the comparing whether the check values of the specified row data in the data to be changed are consistent or not specifically comprises: comparing the check value of a certain row of data appointed in the data to be changed before and after the importing;

for example, if all even lines are specified to be checked, the invention calculates check values of data of all even lines of the data to be changed before and after the data is imported, and compares the check values to determine whether the data to be changed before and after the data is imported are consistent.

For example, if all the rows are specified to be checked, the invention calculates the check values of the data of all the rows of the data to be changed before and after the data is imported, and compares the check values to determine whether the data to be changed before and after the data is imported are consistent.

In a specific implementation, in the embodiment of the present invention, the row of the data to be changed in step S101 and the data check value corresponding to the row are stored in a preset check table, and of course, the sum of the number of rows of the data to be changed and the data check value of the corresponding number of rows may also be stored in the check table, so as to perform data consistency check after data is imported.

That is, the invention guides the data to the corresponding node according to the preset data distribution rule on the basis of not affecting the operation of the existing service, and ensures the strong consistency of the data which is guided into the database and the original data by two means of checking all the number of lines and checking the number of lines by sampling, and meanwhile, the number of sampling checking lines is configurable.

When the data to be changed is exported into the data description text, namely, each data is calibrated and distributed to the database node according to the data positioning, after the data is imported, the data verification value of the data after the data is imported is obtained according to the row (such as even number of rows, namely, the data is sampled and verified) or the preset row (the row can be arbitrarily set or can be the data of all rows, namely, all the row verification) which is recorded in the preset verification table, and the data verification value recorded in the verification table of the data verification value is compared, and if the data verification value is consistent with the data verification value, the data before and after the data is imported is considered to be consistent.

The data to be changed in the embodiment of the invention comprises data to be initialized, data to be migrated and data to be re-distributed. That is, the invention can verify the consistency of the data before and after the data change, such as data initialization, data migration, data duplication, etc. Because the whole data change checking process does not need to lock the current database, the data can be independently positioned according to the rows, the data distribution and the data checking can be independently performed, and the I/O of a database server is occupied only when single-node data is imported, so that the influence on online business is small.

The embodiment of the invention specifically comprises the steps of:

before data migration or data initialization, the data to be migrated needs to be exported into a data description text, both a cross database and a current distributed database are supported, a database table needing to be migrated needs to be exported into a text description file according to the original database grammar during cross database, and the current distributed database can export the distributed data into the text file through a LoadServer.

The embodiment of the invention calculates the data check value of the data line to be changed to be verified, or calculates the sum of the data check value of the line number of the data line to be changed to be verified and the corresponding line number, and stores the sum of the data check value of the line number of the data line to be changed and the corresponding line number in a preset check table for the subsequent consistency check.

When the invention is implemented, the text is read into the memory according to the text description rule, the ASCII value of the current line data (namely the data check value) is calculated and stored into the memory, and when the continuous line data is required to be verified, the ASCII value of each line data is added, so that the sum of the ASCII values of the line data can be obtained.

After splitting the data to be changed and before importing the split data to be changed into the corresponding database node, the embodiment of the invention further comprises the following steps:

The embodiment of the invention imports the split data to be changed to the corresponding database node, and specifically comprises the following steps:

When the method is implemented, the current line data is written into the corresponding database node file cache, and if the cache is full or the configuration file meets the requirement that the current node file stores the line number of the data, the data is written into the file and a new file to be written is generated;

after a certain amount of files are generated, the database cluster server informs the database agent of downloading the corresponding files into the database server according to the number of completed files and the file name list, and the files are imported into the corresponding database;

after data import is completed, the database cluster server initiates verification, and sends a verification request to the database proxy through database cluster management to acquire a data verification value of a data row stored by a current database node counted by the database proxy; the database agent is a database agent corresponding to a database node storing the data of the data line to be changed; or sending a verification request to a database agent through database cluster management according to the row of the data to be changed, and obtaining the sum of the number of rows of the data of the current database node counted by the database agent and the corresponding data verification value, wherein the database agent is the database agent corresponding to the database node storing the data of the row of the data to be changed, and the database agents are respectively in one-to-one correspondence with the database nodes.

Specifically, after data importing is completed, the database cluster server initiates a data verification process, distributes a data verification request to database agents DBAgents of all database nodes of a current verification table, enables the database agents DBAgents to count the number of rows of the current table and ASCII values of data of the current verification table, compares the number of rows of the data with the data verification value after receiving feedback results fed back by all the nodes, and if the number of rows of the data is the same and the sampling verification value is the same, the data consistency verification is passed, and the feedback data migration is successful.

FIG. 2 is a flow chart of another method for checking data in a distributed database according to an embodiment of the present invention, and the method of the present invention will be explained and explained in detail with reference to FIG. 2 below:

s201, starting;

s202, data export;

namely, the data to be changed is led out into a data description text, and the ASCII value of the current line data or the sum of the ASCII values of the current line data is calculated;

s203, data importing and verification data generation;

specifically, the method specifically comprises the following steps: writing the data to be changed into the file cache of the corresponding database node, informing the database cluster of managing the number of completed files and a file name list, triggering a database proxy to download the data to be changed stored in the file cache into the database node through the database cluster management, and sending a verification request to the database proxy through the database cluster management after the data is imported, so as to acquire a data verification value of a data row stored in the current database node counted by the database proxy;

s204, checking data;

and comparing whether the data check values (or the sum of the data check values) of the data lines to be changed, which are required to be verified, before and after the importing are consistent, and if so, determining that the data to be changed are consistent before and after the changing.

S205, ending.

The method according to the invention will be explained and illustrated in further detail below by means of a specific example, the method comprising:

stage one, data file generation:

before data migration or data initialization, the data to be migrated needs to be exported into a data description text, both a cross database and a current distributed database are supported, a database table needing to be migrated needs to be exported into a text description file according to the original database grammar during cross database, and the current distributed database can export the distributed data into a text file through a database cluster server.

Stage two, data migration:

reading the text into the memory according to the text description rule, calculating ASCII value of the current line data and storing the ASCII value into the memory;

acquiring a database node in which the current data is to be stored according to a distributed distribution rule;

writing the current line data into the corresponding database node file cache, and writing the data into a file and generating a new file to be written if the cache is full or the current node file is required to store the line number of the data by the configuration file;

after a certain amount of files are generated, the database cluster server informs the DBagent to download the corresponding files to the database server and import the files to the corresponding databases;

repeating the steps until all the data are imported into the distributed database;

third, checking data consistency:

and after receiving all the data import completion requests, the database cluster server initiates a data verification process.

And the database cluster server distributes the data verification request to DBAgents of all database nodes of the current table, so that the DBAgents count the number of rows of the current table and ASCII values of the data of the current table.

And after receiving feedback results fed back by the nodes, comparing the data line numbers with the data check values, and if the data line numbers are the same and the sampling check values are the same, passing the data consistency check, and successfully migrating the feedback data.

The invention will be described in detail below with respect to an example of migration of a specific DB2 database to a mariadib distributed cluster database:

export data: exporting data to an external file using a DB2 providing method;

generating a check table: generating a check table (supporting full-quantity check and sampling check) according to the configuration check line number and the file line number;

splitting files: reading a data file according to a row, calculating a current row data attribution node according to a distribution rule, judging whether the current row data needs to be checked, if so, generating a current row ASCII value to be accumulated in a check result, generating an sql statement of a database for positioning the current row data, writing the sql statement into a current group check sql file, sequentially circulating, knowing that file reading is finished, and counting the number of the current file rows;

data import: the split data file is imported into a corresponding node database through a database proxy DBagent;

and (3) data verification: after the data is completely imported, a database cluster server initiates a data verification process, compares whether the current file line number and the verification value sum are consistent with the data line number sum and the data verification value sum in the imported database, and if so, the data is consistent before and after data migration, and the data migration is completed; if the migration is inconsistent, the migration needs to be carried out again;

the invention will be described in detail below with respect to a specific example of backup and restore of data based on a mariadib distributed cluster:

acquiring full data: exporting the distributed database data into a text file by using a distributed database import export tool;

generating a check row list: generating a check table (supporting full-quantity check and sampling check) according to the configuration check line number and the file line number;

splitting an original file, reading a data file according to rows, calculating a current row data attribution node according to a distribution rule, judging whether the current row data needs to be checked, if so, generating a current row ASCII value to be accumulated in a check result, generating an sql statement of a database for positioning the current row data, writing the sql statement into a current group check sql file, sequentially circulating, knowing that file reading is finished, and counting the number of rows of the current file;

data recovery: the split data file is imported into a corresponding node database through a database proxy DBagent;

and (3) data verification: after the data is completely imported, a database cluster server initiates a data verification process, compares whether the current file line number and the verification value sum are consistent with the data line number sum and the data verification sum in the new node, and if so, the data is consistent before and after backup recovery, and full data recovery is completed. And if the data are inconsistent, the data recovery process needs to be carried out again.

Compared with the existing technology of the distributed database in the industry, the invention has the following beneficial effects:

1. the invention has good performance. According to the invention, the data verification basic data preparation is completed in the data migration process, and the verification data preparation process is not required to be carried out again, so that the data migration duration time is greatly saved;

2. the method of the invention does not interfere with the operation of the online service, and the invention does not need to add a virtual column in the original check list and lock the list, so the influence on the online service is very small;

3. the method has flexible verification mode, supports the verification of the sampling data and the full data, and can shorten the completion time of the current data migration task by reasonably arranging different verification levels of different verification tables;

4. the method supports data migration data verification across databases, the data migration inlet is a data description text file, each database supports the export of the database into the text description file, and the distributed database can be exported through a database cluster server into a distributed database text file.

Device embodiment

The embodiment of the invention provides a device for checking data of a distributed database, referring to fig. 3, the device comprises: the first calculation unit is used for exporting data to be changed into a data description text, and calculating a check value of specified row data in the data to be changed according to the exported data description text; an importing unit, configured to split the data to be changed according to a row, and import the split data to be changed to a corresponding database node; the second calculation unit is used for calculating the check value of the specified row data in the data to be changed after the data is imported; and the comparison unit is used for comparing whether the check values of the specified row data in the data to be changed are consistent before and after the leading-in, and if so, determining that the data to be changed are consistent before and after the changing.

Further, the first calculating unit of the embodiment of the present invention is further configured to calculate a check value of a certain row of data specified in the data to be changed, or calculate a sum of check values of one or more continuous N rows of data specified in the data to be changed.

Further, the second calculating unit in the embodiment of the present invention is further configured to calculate, when the specified row of data is a row of data, a check value of the specified row of data in the data to be changed after the data to be changed is imported; when the specified behavior is one or more continuous N lines of data, calculating the sum of check values of the one or more continuous N lines of data specified in the data to be changed after being imported;

It should be noted that, in the embodiment of the present invention, the data to be changed includes data to be initialized, data to be migrated, and data to be repartitioned. That is, the invention can verify the consistency of the data before and after the data change, such as data initialization, data migration, data duplication, etc. Because the whole data change checking process does not need to lock the current database, the data can be independently positioned according to the rows, the data distribution and the data checking can be independently performed, and the I/O of a database server is occupied only when single-node data is imported, so that the influence on online business is small.

Further, the importing unit further includes: the splitting module splits the data to be changed according to the row; the acquisition module acquires database nodes in which the split data to be changed are respectively stored according to a distributed distribution rule; and the importing module imports the split data to be changed to a corresponding database node.

Further, the importing unit further includes: the splitting module splits the data to be changed according to the row; the importing module writes the split data to be changed into a file cache of a corresponding database node, informs a database cluster of managing the number of completed files and a file name list, triggers a database agent to download the data to be changed stored in the file cache to the database node through the database cluster management, and the database agent corresponds to the database node one by one respectively.

after a certain amount of files are generated, the database cluster server informs the database agent of downloading the corresponding files into the database server according to the number of completed files and the file name list, and the files are imported into the corresponding database.

In a specific implementation, the second calculation unit in the embodiment of the invention sends a verification request to the database proxy through database cluster management to obtain a data verification value of a data row stored in a current database node counted by the database proxy; the database agent is a database agent corresponding to a database node storing the data of the data row to be changed, or a verification request is sent to the database agent through database cluster management according to the row of the data to be changed, and the sum of the number of rows of the data of the current database node counted by the database agent and a corresponding data verification value is obtained; the database agents are database agents corresponding to the database nodes storing the data of the data row to be changed, and the database agents are respectively in one-to-one correspondence with the database nodes.

Fig. 4 is a schematic diagram of an online data migration system according to an embodiment of the present invention, as shown in fig. 4, after data is imported, a comparison unit initiates a data verification process, distributes a data verification request to database agents dbagents of all database nodes of a current verification table, makes the database agents dbagents count the number of rows of the current table and the ASCII value of the data of the current verification table, and after receiving feedback results fed back by each node, performs comparison between the number of rows of data and the data verification value, if the number of rows of data is the same and the sampling verification value is the same, the data consistency verification is passed, and the data migration is successful.

The relevant content of the device of the present invention can be understood by referring to the embodiment part of the method, and will not be described in detail herein.

Server embodiment

The embodiment of the invention provides a database cluster server, which comprises any one of the distributed database data verification devices in the device embodiment.

The relevant content in the embodiments of the present invention may be understood by referring to the device embodiment and the method embodiment, and will not be described herein.

The invention can at least achieve the following beneficial effects:

the invention can accurately determine the consistency of the data to be changed before and after the change by comparing whether the data check values of the data lines to be changed to be verified before and after the introduction are consistent or comparing whether the sum of the data check values of the data lines to be changed to be verified before and after the introduction is consistent with the sum of the data check values of the corresponding data lines, thereby effectively solving the problem that the distributed database in the prior art cannot determine the data consistency before and after the change.

Although the preferred embodiments of the present invention have been disclosed for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, and accordingly the scope of the invention is not limited to the embodiments described above.

Claims

1. A method for data verification of a distributed database, comprising:

the method comprises the steps of leading out data to be changed into a data description text, and calculating a check value of specified row data in the data to be changed according to the led out data description text; wherein the check value is an ASCII value;

after the data is imported, calculating the check value of the designated line data in the data to be changed after the data is imported, comparing whether the check values of the designated line data in the data to be changed are consistent before and after the data is imported, and if so, determining that the data to be changed are consistent before and after the data is changed;

the calculating the check value of the specified row data in the data to be changed specifically includes:

calculating the check value of a certain row of data appointed in the data to be changed, or calculating the sum of the check values of one or more continuous N rows of data appointed in the data to be changed;

importing the split data to be changed to a corresponding database node, wherein the method specifically comprises the following steps of:

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

when the specified row data is a certain row, calculating the verification value of the specified row data in the data to be changed after the data to be changed is imported specifically includes: calculating a check value of a certain row of data appointed in the data to be changed after being imported; the comparing whether the check values of the specified row data in the data to be changed are consistent or not specifically comprises: comparing the check value of a certain row of data appointed in the data to be changed before and after the importing;

3. The method according to any one of claims 1-2, wherein after splitting the data to be changed according to rows and before importing the split data to be changed to a corresponding database node, further comprising:

4. The method according to any one of claims 1-2, wherein,

the data to be changed comprises data to be initialized, data to be migrated and data to be re-distributed.

5. An apparatus for verifying data in a distributed database, comprising:

the first calculation unit is used for exporting data to be changed into a data description text, and calculating a check value of specified row data in the data to be changed according to the exported data description text; wherein the check value is an ASCII value;

the comparison unit is used for comparing whether the check values of the specified row data in the data to be changed are consistent before and after the data to be changed are imported, and if so, determining that the data to be changed are consistent before and after the data to be changed are changed;

the first calculating unit is further configured to calculate a check value of a certain row of data specified in the data to be changed, or calculate a sum of check values of one or more continuous N rows of data specified in the data to be changed;

the importing unit further includes:

6. The apparatus of claim 5, wherein the device comprises a plurality of sensors,

the second calculating unit is further configured to calculate a check value of a certain row of data specified in the data to be changed after the data to be changed is imported when the certain row of data is specified; when the specified behavior is one or more continuous N lines of data, calculating the sum of check values of the one or more continuous N lines of data specified in the data to be changed after being imported;

7. The apparatus according to any one of claims 5-6, wherein the importing unit further comprises:

8. The apparatus according to any one of claims 5 to 6, wherein,

9. A database cluster server comprising the apparatus for distributed database data verification of a distributed database as claimed in any one of claims 5 to 8.