CN110647531A

CN110647531A - Data synchronization method, device, equipment and computer readable storage medium

Info

Publication number: CN110647531A
Application number: CN201910751919.7A
Authority: CN
Inventors: 许浩奇; 刘广平
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2019-08-15
Filing date: 2019-08-15
Publication date: 2020-01-03

Abstract

The invention relates to the technical field of big data processing, and discloses a data synchronization method.A task data table to be synchronized and a second check code of task data in the task data table are determined according to the task data table to be synchronized and a first check code of task data in historical data and historical data, and only the incremental task data need to be synchronized during synchronization; the invention also provides a data synchronization device, equipment and a computer readable storage medium, and the data synchronization mode reduces data in the synchronization process, shortens the data synchronization time, reduces the resource occupancy rate of the equipment during synchronization and improves the resource utilization rate of the equipment.

Description

Data synchronization method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of big data processing technologies, and in particular, to a data synchronization method, apparatus, device, and computer-readable storage medium.

Background

Hive is a data warehouse basic tool based on Hadoop, which can be regarded as a user programming interface (API) to some extent, does not store and process data per se, and depends on HDFS to store data and MR to process data. It provides a series of tools for extracting, converting and loading data. Data is stored in dependence of the HDFS, and data is processed in dependence of the MR. In Hadoop to process structured data. The Hive query language is used for processing structured data for MapReduce. Hive is not a real-time query language.

In actual development, it is often used for data cleaning, storage and analysis, for example: processing the data on Hive, and then leading the data to a real-time query engine, such as: hbase and elastic search are used for upper-layer application to perform real-time query, while the Hive task is operated with certain timeliness, data updating synchronization can be performed after expiration, most of the existing data synchronization updating adopts a full-amount synchronization mode, namely, during each updating, historical data and new data are required to be exported and updated together, along with the continuous increase of data amount, the consumption of resources and memory during updating is also continuously increased, and the tasks occupy cluster operation resources for a long time and waste resources; meanwhile, due to the fact that the finer mode generally has higher delay, continuous superposition updating can also cause the occurrence of the whole derivative crossing the aging time, and the Hive task can refresh data once the Hive task has aged, so that the previous data is not completely derived and is refreshed by the Hive task, and the problem of data loss exists.

Disclosure of Invention

The invention mainly aims to provide a data synchronization method, a data synchronization device, data synchronization equipment and a computer readable storage medium, and aims to solve the technical problem that in the existing data synchronization mode, the data repeatability of each synchronization is large, so that the time is too long.

In order to achieve the above object, the present invention provides a data synchronization method applied to a data storage device, the data synchronization method comprising the steps of:

detecting whether the data validity of the data storage device meets an updating condition, wherein the updating condition is that the data validity reaches an expiration time;

if the data validity meets the updating condition, acquiring a task data table to be synchronized to the data storage device at the current time and a first check code of task data in historical data, wherein the task data table at least comprises the new data and the historical data which are changed currently, the historical data is the task data stored in the data storage device currently, the first check code comprises N first sub-check codes, the first sub-check codes are calculated based on at least one row of data or at least one task data in the historical data, and N is greater than or equal to 1;

calculating a second check code of the task data in the task data table by using a calculation algorithm which is the same as that used for calculating the first check code, wherein the second check code comprises M second sub check codes, the second sub check codes are calculated based on at least one row of data or at least one task data in the task data table, and M is greater than or equal to N;

comparing whether the second check code is consistent with the first check code or not to judge whether the task data currently stored in the data storage equipment is abnormal or not;

if the abnormal task data exist, screening the abnormal task data from the task data table, keeping the historical data, and synchronizing the abnormal task data into the data storage equipment, wherein the abnormal task data are incremental task data;

and if the abnormal condition does not exist, replacing the first check code with the second check code, and deleting the historical data and the first check code in the task data table.

In other embodiments, the calculation algorithm includes a hamming code check algorithm and a hash algorithm.

In other embodiments, the step of calculating the second check code of the task data in the task data table by using the same calculation algorithm as that used for calculating the first check code includes:

if the calculation algorithm is a Hash algorithm, calculating an integral check value of the task data in the task data table according to the Hash algorithm;

extracting each row of data or each task data in the task data table, and calculating a verification code of each row of data or each task data according to the Hash algorithm to obtain a second sub-verification code, wherein the second sub-verification code is an MD code;

and independently storing all the MD codes into a preset verification code table in the task data table.

In some other embodiments, the step of comparing whether the second check code is consistent with the first check code to determine whether the task data currently stored in the data storage device has an exception includes:

extracting a first check value and a first sub check code from the first check code, wherein the first check value is an integral check value of the historical data;

comparing whether the overall check value is the same as the first check value;

if the overall check value is the same as the first check value, determining that the incremental task data does not exist in the task data table;

and if the integral check value is different from the first check value, determining that the incremental task data exists in the task data table.

In other embodiments, after the step of determining that the incremental task data exists in the task data table, the method further includes:

and comparing each MD code in the second check code with the corresponding first sub-check code in the first check code one by one, and screening out different MD codes in the second check code based on the comparison result.

In other embodiments, the step of screening out abnormal task data from the task data table, retaining the historical data, and synchronizing the abnormal task data to the data storage device includes:

extracting corresponding increment task data from the task data table according to the screened MD codes different from the first sub-check codes, inserting the increment task data into a preset increment data table, setting the attribute of the inserted increment task data as a distinguishing field and recording the insertion time of the distinguishing field;

and reserving the task data currently stored in the data storage device, determining the incremental task data corresponding to the insertion time with small current time error in the incremental data table, and synchronizing the incremental task data to the data storage device.

In other embodiments, after the steps of screening out abnormal task data from the task data table, retaining the historical data, and synchronizing the abnormal task data to the data storage device, the method further comprises:

calculating the time deviation amount between the insertion time of other fields except the insertion time of the distinguishing field in the increment data table and the insertion time of the distinguishing field;

judging whether the time deviation amount is larger than a preset time difference or not;

and if the time deviation is greater than the preset time difference, deleting the incremental task data corresponding to all the fields greater than the preset time difference from the incremental data table.

In addition, to achieve the above object, the present invention also provides a data synchronization apparatus, including:

the detection module is used for detecting whether the data validity of the data storage equipment meets an updating condition, wherein the updating condition is that the data validity reaches the expiration time;

the data acquisition module is used for acquiring a task data table to be synchronized to the data storage device at the current time and a first check code of task data in historical data when the data validity meets an updating condition, wherein the task data table at least comprises new data and historical data which change currently, the historical data is the task data stored in the data storage device currently, the first check code comprises N first sub-check codes, the first sub-check codes are obtained by calculation based on at least one row of data or at least one task data in the historical data, and N is greater than or equal to 1;

the calculation module is used for calculating a second check code of the task data in the task data table by using a calculation algorithm which is the same as that used for calculating the first check code, wherein the second check code comprises M second sub check codes, the second sub check codes are calculated based on at least one row of data or at least one task data in the task data table, and M is greater than or equal to N;

the comparison module is used for comparing whether the second check code is consistent with the first check code or not so as to judge whether the task data currently stored in the data storage equipment is abnormal or not;

the synchronization module is used for screening abnormal task data from the task data table when abnormality exists, reserving the historical data and synchronizing the abnormal task data into the data storage device, wherein the abnormal task data are incremental task data; and replacing the first check code with the second check code and deleting the historical data and the first check code in the task data table when no abnormity exists.

Optionally, the calculation algorithm includes a hamming code check algorithm and a hash algorithm.

Optionally, the computing module includes an arithmetic unit and a storage unit, wherein,

when the calculation algorithm is a hash algorithm, the operation unit is used for calculating an overall check value of the task data in the task data table according to the hash algorithm; extracting each row of data or each task data in the task data table, and calculating a verification code of each row of data or each task data according to the Hash algorithm to obtain a second sub-verification code, wherein the second sub-verification code is an MD code;

the storage unit is used for storing all the MD codes into a preset verification code table in the task data table.

Optionally, the comparing module includes an extracting unit and a comparing unit, wherein,

the extraction unit is used for extracting a first check value and a first sub-check code from the first check code, wherein the first check value is an integral check value of the historical data;

the comparison unit is used for comparing whether the overall check value is the same as the first check value or not;

Optionally, the comparing module further includes a screening unit, configured to compare each MD code in the second check code with the corresponding first sub-check code in the first check code one by one, and screen out different MD codes in the second check code based on a comparison result.

Optionally, the synchronization module is configured to extract corresponding incremental task data from the task data table according to the screened MD code different from the first sub-check code, insert the incremental task data into a preset incremental data table, set an attribute of the inserted incremental task data as a distinguishing field, and record an insertion time of the distinguishing field; and reserving the task data currently stored in the data storage device, determining the incremental task data corresponding to the insertion time with small current time error from the incremental data table, and synchronizing the incremental task data into the data storage device.

Optionally, the calculating module is further configured to calculate a time deviation amount between the insertion time of the other field in the incremental data table except the insertion time of the distinguishing field and the insertion time of the distinguishing field;

the comparison module is also used for judging whether the time deviation value is greater than a preset time difference or not;

and the synchronization module is further used for deleting the incremental task data corresponding to all the fields with the time difference larger than the preset time difference from the incremental data table when the time deviation is larger than the preset time difference.

In addition, to achieve the above object, the present invention is also a data synchronization apparatus, including: a memory, a processor and a data synchronization program stored on the memory and executable on the processor, the data synchronization program when executed by the processor implementing the steps of the data synchronization method as in any one of the above.

Furthermore, to achieve the above object, the present invention is also a computer readable storage medium having stored thereon a data synchronization program, which when executed by a processor, implements the steps of the data synchronization method according to any one of the above.

The incremental task data in the task data table to be synchronized is determined according to the task data table to be synchronized and the second check code of the task data in the task data table, and the first check codes of the task data in the historical data and the historical data, and only the incremental task data needs to be synchronized during synchronization.

Drawings

FIG. 1 is a schematic structural diagram of an operating environment of a data synchronization system according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of another data synchronization system according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a first embodiment of a data synchronization method according to the present invention;

FIG. 4 is a flowchart illustrating a data synchronization method according to a second embodiment of the present invention;

FIG. 5 is a functional block diagram of a data synchronization apparatus according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a data synchronization device.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an operating environment of a data synchronization device according to an embodiment of the present invention.

As shown in fig. 1, the data synchronization apparatus includes: a processor 101, e.g. a CPU, a communication bus 102, a user interface 103, a network interface 104, a memory 105. Wherein the communication bus 102 is used for enabling connection communication between these components. The user interface 103 may comprise a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the network interface 104 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface). The memory 105 may be a high-speed RAM memory or a non-volatile memory (e.g., a disk memory). The memory 105 may alternatively be a storage device separate from the processor 101 described above.

Those skilled in the art will appreciate that the hardware configuration of the data synchronization system shown in fig. 1 does not constitute a limitation of the data synchronization system, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, the memory 105, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a data synchronization program for implementing data synchronization. The operating system is a program for managing and controlling the data synchronization system and software resource calls in the system and the memory, and supports the operation of the data synchronization program and other software and/or programs.

In the hardware configuration of the data synchronization system shown in fig. 1, the network interface 104 is mainly used for accessing a network; the user interface 103 is mainly used to detect whether a data synchronization operation is required in the system or monitor whether data in the system has information such as update, exception, etc., and the processor 101 may be used to call a data synchronization program stored in the memory 105 and perform the following operations of various embodiments of the data synchronization method.

In the embodiment of the present invention, the implementation shown in fig. 1 may also be a networking system composed of a mobile terminal and a server, where the mobile terminal is used as a device for generating a task data table, and the server is used for synchronizing the task data table on the mobile terminal, and a processor of the server recognizes incremental task data by reading a program code, which is stored in a buffer or a storage unit and can implement a data synchronization method, and implements the synchronization operation.

Further, the data synchronization method provided by the present invention can also be used to implement synchronization database data synchronization between multiple servers S and terminals TE, or between several terminals TE and several servers S, where a synchronization database can be broadly understood as a storage device that can store any data. If synchronization is to be performed between the terminal TE, which acts as a generating device for the task data table to be synchronized, or the server S, which acts as a synchronization server, the server S is typically a web server or PC. The TE is typically a mobile phone, PC (personal computer), laptop or PDA device.

As shown in fig. 2(a), the first example shows a synchronization system in which a client device TE and a synchronization server S are connected to a local area network LAN. The client device TE connected to the network LAN comprises functional components for communicating with the devices of the network LAN, such as a network card and software controlling data theft. The local area network LAN may be any kind of local area network and the TE may also be connected to the server S via the internet, typically using a firewall FW. The terminal TE may also be connected wirelessly to the local area network LAN via the access point AP.

A second example of which shows a synchronization system for a client device TE communicating with a server S via a mobile network MNW. The terminal TE, when connected to the network side W, comprises functionality for communicating wirelessly with the network MNW. There may also be other networks, such as a local area network LAN, between the mobile network MNW and the server S. The mobile network MNW may be any known wireless network, for example a network supporting GSM services, a network supporting GPRS (general packet radio service), a third generation mobile network such as a network defined according to the 3GPP (third generation partnership project) network, a wireless local area network WL, AN or a private network. It should be noted that the server S itself may comprise the database it synchronizes, or it may be located in another device; in fig. 2(b), the server S and the database DB are separated for convenience of explanation. Synchronization configurations other than that illustrated in fig. 2 are also possible.

Based on the hardware structure of the data synchronization system, the embodiments of the data synchronization method of the present invention are provided.

Referring to fig. 3, fig. 3 is a flowchart of a data synchronization method according to an embodiment of the present invention. In this embodiment, the data synchronization method specifically includes the following steps:

step S310, detecting whether the data validity of the data storage device meets an updating condition, wherein the updating condition is that the data validity reaches an expiration time;

in this step, the validity of the detection data can be achieved from two aspects, on one hand, whether the data on the mobile terminal is updated by the service end of the synchronous data or whether the time for updating the data is reached, so as to judge whether the data stored on the service end is out of date; on the other hand, the self-checking is realized in a way that the server performs self-checking, and the self-checking specifically judges whether the data is out of date or not by detecting whether the difference value between the data start time currently stored by the server and the current time is greater than a preset value or whether the timing of a timer for data synchronization on the server is reached.

In practical application, it may also be that whether the data in the data storage device reaches the time of data update, specifically, it may also be that whether the data update condition is met is determined by detecting a data update request on the data storage device, in practical application, for the case of automatic update, automatic timing update is implemented by setting a timer, at this time, after the data storage device detects that the timer times out, the device may think that the stored data has lost validity, and needs to obtain new data from a data source device connected to the data storage device for update.

For the case of actively triggering update, the data storage device determines whether the data needs to be updated by detecting a data update request, and if the request is detected, the original data stored in the device needs to be replaced by new data update.

Step S320, if the data validity meets the update condition, acquiring a task data table to be synchronized to the data storage device at the current time and a first check code of task data in historical data, where the task data table at least includes new data and historical data that change currently, the historical data is the task data currently stored in the data storage device, the first check code includes N first sub-check codes, the first sub-check codes are calculated based on at least one row or at least one task data in the historical data, and N is greater than or equal to 1;

in this embodiment, the acquired task data refers to data on a user terminal or a server connected to a data storage device, and the data storage device may be understood as a backup database, a data buffer, a database in the server, or other storage devices, or may be a user terminal itself with a data saving function, where the user terminal may generate data in real time, and the data generated in real time may be retained by the terminal for a short time.

In practical applications, the acquired task data may be a data source acquired from a user terminal connected to the data storage device, and the historical data and the first check code of the historical data may be acquired from the data storage device itself, and at this time, for the data storage device, simultaneous storage of the data and the corresponding check code is required each time the data is synchronized.

In this embodiment, the first check code may be specifically understood as a set including a plurality of sub-check codes, and each sub-check code corresponds to each row of data or each task data in the data, that is, each sub-verification code is calculated based on each row of data or each task data. Of course, in order to reduce the number of check codes, multiple lines of data or multiple task data may be used for calculation, which may further reduce the amount of data calculation and also reduce the power consumption of the data storage device.

In the present embodiment, the calculation algorithm for the first sub-check code includes multiple algorithms, such as hamming check algorithm and hash algorithm, and the hash algorithm includes MD2, MD4 and MD5, and the hash value converted from the digital signature by using the hash method is called message-digest (message-digest), and there is also Secure Hash Algorithm (SHA), which is a standard algorithm and can generate a larger (60bit) message digest, and is somewhat similar to the MD4 algorithm.

Step S330, calculating a second check code of the task data in the task data table by using a calculation algorithm which is the same as that used for calculating the first check code, wherein the second check code comprises M second sub check codes, the second sub check codes are calculated based on at least one row of data or at least one task data in the task data table, and M is greater than or equal to N;

step S340, comparing whether the second check code is consistent with the first check code, to determine whether the task data currently stored in the data storage device is abnormal;

in this embodiment, the second check code is also a set including a plurality of sub-check codes, and in order to ensure the correctness and comparability of the check code, the same calculation algorithm as that used for calculating the sub-check codes in the first check code is used, for example, a hash algorithm is used, and a hash algorithm is also used here, and if a hamming code check algorithm is used, a corresponding hamming code check algorithm should be used here.

In this embodiment, when calculating the second sub check code in the second check code, it is further necessary to obtain a data partitioning rule when calculating the first sub check code, that is, when calculating the first sub check code, the device needs to perform data small set partitioning or task data small set partitioning on the historical data, and then calculate the first sub check code for each small set according to a corresponding check algorithm; before the second sub-check code is calculated, the division rule of the device on the historical data needs to be determined, then the current data is divided according to the division rule, and then the second sub-check code is calculated based on the same check algorithm.

For this case, if the current data and the historical data are the same, all the calculated second sub-check codes correspond to the first sub-check code, and if they are not the same, it is determined whether the data are the same.

Step S350, if the abnormal task data exist, screening the abnormal task data from the task data table, keeping the historical data, and synchronizing the abnormal task data into the data storage device, wherein the abnormal task data are incremental task data;

step S360, if no abnormity exists, the first check code is replaced by the second check code, and the historical data and the first check code in the task data table are deleted.

In this embodiment, the exception specifically means that the second check code is not equal to or different from the first check code, and at this time, the current data and the historical data are compared with each other, then a part with a difference is determined, and finally the part with the difference is synchronized into the data storage device.

In practical application, the data of the difference part comprises the following two conditions, one is that the data has the same part, and the corresponding part in the current data is replaced by the corresponding part in the historical data, so that only part of the data needs to be synchronized, and the other part of the data is reserved, thereby reducing the data amount of synchronization and improving the efficiency of synchronization; the other is newly added data, and for the situation, the newly added partial data is directly synchronized into the device, specifically, the partial data is added on the basis of the data originally stored by the data storage device.

In the embodiment, data is synchronized through an increment scheme, as long as the data of the increment of the updated data is synchronized, the data synchronization time is greatly shortened, and meanwhile, the utilization rate of resources is improved.

In this embodiment, the algorithm for calculating the check code may be an existing symmetric algorithm, an asymmetric algorithm, or even other algorithms, as long as the algorithm can perform encryption calculation on data, and then different algorithms of the data can be compared, and the hamming code check algorithm and the hash algorithm may be preferred.

If a hash algorithm is adopted, the step of calculating the second check code of the task data in the task data table by using the same calculation algorithm for calculating the first check code comprises the following specific implementation steps:

calculating an integral check value of the task data in the task data table according to the Hash algorithm;

In this embodiment, the following two calculation schemes may be specifically adopted in the process of calculating the second check code of the task data table, and specifically, the two schemes are distinguished from the viewpoint of dividing the task data into small sets:

the first is a check code which is calculated by taking a current task data table as a whole;

the second is a check code calculated by partial data in the task data table;

the check code of the first calculation scheme is mainly used for judging whether the current data are different, and if the current data are different, the check code passing through the second calculation scheme is used for determining different partial data in the task data table.

In this embodiment, when calculating the check code of the second calculation scheme, it is further required to obtain a partition rule for task data in the task data table, that is, it is required to determine a data partition manner when the historical data calculates the first sub-check code, for example, a row-by-row partition, or a partition of each task, then calculate a check code for each row of data or calculate a check code for each task, and so on, and after determining the partition manner for the historical data, apply the partition manner to the task data table, partition the task data table, and then calculate the second sub-check code.

At this time, the step of comparing whether the second check code is consistent with the first check code to determine whether the task data currently stored in the data storage device is abnormal includes:

comparing whether the overall check value is the same as the first check value;

In this embodiment, for the data storage device to determine the abnormality of the task data, it actually identifies whether new data is added to the data to be synchronized, and determines whether new data is added by using the overall check code of the data, where the new data may be understood as newly added data or as that the original historical data is changed. However, in either case, if the two check values are not equal, it is determined that new data exists, and the process jumps to step S350.

In this embodiment, after the step of determining that the current data has the incremental task data, the method further includes:

and comparing each MD code in the second check code with the corresponding first sub-check code in the first check code one by one, and screening out the MD codes in the second check code which are different from the first check code in the first check code on the basis of the comparison result.

In this embodiment, in addition to the above step of judging data as abnormal as a whole, and then directly comparing the data to screen out different data to implement data synchronization, in the screening of different data, the data can be screened by the sub-check code of the data, in practical application, when the overall check code of the historical data and the current data of the data storage device is calculated, the data is further subdivided and divided, for example, the data is formatted according to the stored format, then the sub-check code of each row of data is calculated according to the check algorithm for calculating the overall check code, then after the data is judged to have new data, the sub-check code (the second self-check code) in the current data and the sub-check code (the first sub-check code) of the historical data are obtained to be compared with each other, and then the data is extracted according to the comparison result, so as to obtain different partial data, and finally, synchronizing according to the part of data, thereby greatly reducing the synchronizing amount of the data, reducing the consumption of resources and improving the efficiency of data synchronization.

In practical applications, the MD5 algorithm in the hash algorithm is preferably used for the above calculation algorithm, and the calculated check code is the MD5 code, but in order to ensure the accuracy of data comparison, in the embodiment of the present invention, it is preferable that the data is divided in a task data dividing manner, for example, when there are a plurality of task data in the obtained current data, each task data is used as a small data set, and then the check code is calculated to obtain the sub-check code of the task data, so that the integrity of the data can be ensured, and the error of calculating the sub-check code is reduced.

Of course, if the data volume is small, the data can be divided by directly using a line division mode, so that the error of each line of data can be accurately achieved, the data volume of the synchronous data is reduced, and the data synchronization efficiency is improved.

In this embodiment, the steps of screening out abnormal task data from the current data, retaining the historical data, and synchronizing the abnormal task data to the data storage device include:

extracting corresponding increment task data from the current data according to the screened different MD codes, inserting the increment task data into a preset increment data table, setting the attribute of the inserted increment task data as a distinguishing field and recording the insertion time of the distinguishing field;

and reserving the task data currently stored in the data storage device, and selecting the increment task data corresponding to the insertion time with small current time error from the increment data table and synchronizing the increment task data into the data storage device.

In this embodiment, in order to facilitate a fast data synchronization operation, after a sub-parity that is different from historical data in current data is compared, corresponding data is extracted according to the sub-parity and stored in an incremental data table, where the incremental data represents new data added to the current data when data is updated and is a call data table when data synchronization is performed.

Further, after the steps of screening abnormal task data from the current data according to the abnormality, retaining the historical data, and synchronizing the abnormal task data to the data storage device, the method further includes:

In the data synchronization method provided by the embodiment of the invention, when the data needs to be updated and synchronized, the incremental task data in the data to be synchronized and the historical data are determined based on the comparison between the check code of the data to be synchronized and the check code of the historical data, and then the incremental task data are synchronized to the storage device.

The data synchronization method provided by the present embodiment is described below with a data warehouse infrastructure established on Hadoop as an execution subject of the method. The data warehouse infrastructure provides a set of tools that can be used to perform data Extraction Transformation Loading (ETL), a mechanism that can store, query, and analyze large-scale data stored in Hadoop, and also defines a simple SQL-like query language, called HQL, that allows SQL-familiar users to query data, as shown in particular in fig. 4:

in this embodiment, for the operation of data synchronization, the mutual variables are interacted in the form of data tables, for example, the historical data and the current data are present in the form of data tables, in this embodiment, after the historical data and the current data are obtained, the data form a data full table, the incremental data also needs to establish a table, and the check code is also in the table, and the specific process is as shown in the following figures:

in step S40, the task is the original Hive full-volume task running number.

Step S41, the task is an incremental increment table running number task, the increment table retains increment data in nearly 10 days, and the main processing logic is: and calculating difference data of the history table and the latest data table of the current day, and inserting the difference data into the incremental Hive table. There is a partition field in the delta table to record the time of day's insertion. Finally, historical incremental data for more than 10 days is deleted simultaneously.

And step S42, after the task is completed in the step S41, the incremental derivative task is called up, and a small amount of data is updated to the real-time query system according to the incremental Hive table.

In step S43, the task performs MD5 codes on all fields in each row in the Hive full table, and inserts the MD5 codes into the Hive MD5 table for the Hive increment table task that executes the task in S41 next time.

The procedures of steps S40-S43 are respectively represented in the form of a table, for example, as follows:

taking the current data to be synchronized as a task data table, simultaneously calculating corresponding MD5 codes, and respectively storing the data into corresponding tables, wherein the following two tables are shown

TABLE A Total amount

{

PRIMARY _ KEY STRING COMMENT 'physical Primary Key'

COL1STRING COMMENT "field 1"

COL2STRING COMMENT "field 2"

}

Table A _ MD5

{

PRIMARY _ KEY STRING COMMENT 'physical Primary Key'

MD5STRING COMMENT "FULL-FIELD MD5 CODE"

}

After comparing the check codes on the task data table and the historical data, summarizing the identified incremental data into an incremental table, and only calling the table and selecting and synchronizing the data according to the time sequence during synchronization, wherein the data is selected and synchronized as shown in the following table

TABLE A _ Delta

{

PRIMARY _ KEY STRING COMMENT 'physical Primary Key'

COL1STRING COMMENT "field 1"

COL2STRING COMMENT "field 2"

PARTITION mapping component "

}

The following description takes 10 days as the update time point as an example:

1) hive full-scale task: is responsible for carrying out calculation statistics on the full data

2) And Hive increment table task:

taking the A _ total amount of the table as a base table, using a primary key to associate the A _ MD5 of the table, wherein the A _ MD5 records MD5 codes of all fields CONCAT of the A _ total amount of the table in the previous day, the MD5 codes of the A _ total amount of the table in the last day after association are compared with the MD5 codes of the previous day, and the data which are not equal are newly added in the current day are inserted into the A _ incremental table of the table, and the partition field is set as the time of the day.

Finally, deleting data with increment history data exceeding 10 days;

3) hive full scale MD5 information

The task was MD5 production based on Hive full scale data for the next day incremental task calculation usage.

The calculation logic of code MD5 is as follows MD5(CONCAT (NVL (COL1, "), '&', NVL (COL2,"))

4) And the incremental derivative task acquires the data of the current day according to the A _ incremental table and synchronizes a small amount of incremental data to a real-time system. Because the increment table retains the increment data within 10 days, in an extreme case, if the increment derivative task cannot complete the derivative on the current day, the complement operation can be performed through the partition field on the next day.

Further, in this embodiment, the task data table, the check table, and the increment table are associated with each other through a physical primary key table header set in the table, and after the different sub check codes are determined, the corresponding table and data may be directly called to perform data synchronization operation.

The method is based on the method that only data which is increased when the data is synchronized, so that the data synchronization time is greatly shortened, and meanwhile, the utilization rate of resources is improved.

In the actual task scheduling design, the Hive full-scale MD5 task is pushed to the last operation, the increment derivative task does not need to depend on the increment derivative task, and the task can be started up only after the execution of the Hive increment table task is completed, so that the task waiting time is reduced.

By means of MD5 code comparison, the problem that each field needs to be compared with a difference in Hive increment task comparison is avoided. Execution performance and code readability are higher.

The Hive increment table reserves increment historical data within 10 days, and guarantees the risk that the derivative task cannot be finished when the derivative task is finished, so that the data across the day is brushed away.

In order to solve the above problem, an embodiment of the present invention further provides a data synchronization apparatus, and referring to fig. 5, fig. 5 is a schematic diagram of functional modules of the data synchronization apparatus provided in the embodiment of the present invention. In this embodiment, the apparatus comprises:

a detecting module 51, configured to detect whether data validity of the data storage device meets an update condition, where the update condition is that the data validity has reached an expiration time;

a data acquisition module 52, configured to acquire, when the data validity meets an update condition, a task data table to be synchronized to the data storage device at a current time and a first check code of task data in historical data, where the task data table at least includes new data and historical data that change currently, the historical data is the task data currently stored in the data storage device, the first check code includes N first sub-check codes, the first sub-check code is calculated based on at least one row of data or at least one task data in the historical data, and N is greater than or equal to 1;

a calculating module 53, configured to calculate a second check code of the task data in the task data table by using a same calculation algorithm for calculating the first check code, where the second check code includes M second sub-check codes, the second sub-check code is calculated based on at least one row of data or at least one task data in the task data table, and M is greater than or equal to N;

a comparing module 54, configured to compare whether the second check code is consistent with the first check code, so as to determine whether the task data currently stored in the data storage device is abnormal;

a synchronization module 55, configured to, when there is an anomaly, screen out anomalous task data from the task data table, retain the historical data, and synchronize the anomalous task data to the data storage device, where the anomalous task data is incremental task data; and replacing the first check code with the second check code and deleting the historical data and the first check code in the task data table when no abnormity exists.

Based on the same embodiment as the data synchronization method in the above embodiment of the present invention, the contents of the embodiment of the data synchronization apparatus are not described in detail in this embodiment.

The invention also provides a computer readable storage medium.

In this embodiment, the computer readable storage medium has a data synchronization program stored thereon, and the data synchronization program, when executed by a processor, implements the steps of the data synchronization method as described in any one of the above embodiments. The method implemented by the data synchronization program when executed by the processor may refer to various embodiments of the data synchronization method of the present invention, and therefore, redundant description is not repeated.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM), and includes instructions for causing a terminal (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The present invention is described in connection with the accompanying drawings, but the present invention is not limited to the above embodiments, which are only illustrative and not restrictive, and those skilled in the art can make various changes without departing from the spirit and scope of the invention as defined by the appended claims, and all changes that come within the meaning and range of equivalency of the specification and drawings that are obvious from the description and the attached claims are intended to be embraced therein.

Claims

1. A data synchronization method applied to a data storage device is characterized by comprising the following steps:

2. The data synchronization method of claim 1, wherein the computational algorithm comprises a hamming code check algorithm and a hash algorithm.

3. The data synchronization method of claim 2, wherein the step of calculating the second check code of the task data in the task data table using the same calculation algorithm as that used to calculate the first check code comprises:

4. The data synchronization method of claim 3, wherein the step of comparing whether the second parity code is consistent with the first parity code to determine whether the task data currently stored in the data storage device has an exception comprises:

comparing whether the overall check value is the same as the first check value;

5. The data synchronization method of claim 4, wherein after the step of determining that the incremental task data exists in the task data table, further comprising:

6. The data synchronization method of claim 5, wherein the step of screening the task data table for anomalous task data, and retaining the historical data and synchronizing the anomalous task data to the data storage device comprises:

7. The data synchronization method of claim 6, further comprising, after the steps of screening the task data table for anomalous task data, preserving the historical data, and synchronizing the anomalous task data to the data storage device:

8. A data synchronization apparatus, characterized in that the data synchronization apparatus comprises:

9. A data synchronization apparatus, characterized in that the data synchronization apparatus comprises: memory, a processor and a data synchronization program stored on the memory and executable on the processor, the data synchronization program when executed by the processor implementing the steps of the data synchronization method of any of claims 1-7.

10. A computer-readable storage medium, having stored thereon a data synchronization program, which when executed by a processor implements the steps of the data synchronization method of any one of claims 1-7.